[jira] [Commented] (ARROW-14939) [R] Problem with new variables in dataset schema

Neal Richardson (Jira) Wed, 01 Dec 2021 11:27:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451987#comment-17451987
 ]


Neal Richardson commented on ARROW-14939:
-----------------------------------------

I think you want to use the {{unify_schemas = TRUE}} argument. It's FALSE by 
default because (depending on how many files you have) it can be slow because 
it has to check all of the files, but it sounds like in your case that's what 
you want it to do. Or else you can provide an explicit schema that covers all 
of the files, if you know it.

> [R] Problem with new variables in dataset schema
> ------------------------------------------------
>
>                 Key: ARROW-14939
>                 URL: https://issues.apache.org/jira/browse/ARROW-14939
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 6.0.1
>         Environment: RStudio Version
> --------------------------------------------------
> 1.4.1717
> Session Information
> --------------------------------------------------
> R version 4.1.0 (2021-05-18)
> Platform: x86_64-apple-darwin17.0 (64-bit)
> Running under: macOS 12.0.1
> Matrix products: default
> LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
> [1] arrow_6.0.1
> loaded via a namespace (and not attached):
>  [1] tidyselect_1.1.1 bit_4.0.4        compiler_4.1.0   magrittr_2.0.1   
> assertthat_0.2.1 R6_2.5.1        
>  [7] tools_4.1.0      glue_1.5.0       bit64_4.0.5      vctrs_0.3.8      
> rlang_0.4.12     purrr_0.3.4     
> System Information
> --------------------------------------------------
> sysname        : Darwin                                                       
>                                   
> release        : 21.1.0                                                       
>                                   
> version        : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; 
> root:xnu-8019.41.5~1/RELEASE_X86_64
> nodename       :                                                              
>       
> machine        : x86_64                                                       
>                                   
> login          : root                                                         
>                                   
> user           : os                                                           
>                                   
> effective_user : os                                                           
>                                   
> Platform Information
> --------------------------------------------------
> OS.type    : unix
> file.sep   : /
> dynlib.ext : .so
> GUI        : RStudio
> endian     : little
> pkgType    : mac.binary
> path.sep   : :
> r_arch     : 
>            Reporter: Pal
>            Priority: Critical
>
> Hi, 
> I have a problem with updating the schema in arrow::open_dataset().
> For example, let's say I have one parquet file with two columns (a and b) and 
> another file with three columns (a and b and c). When I open this dataset, 
> its schema will only detect columns a and b. Am I missing something ? >From 
> my previous experience, I already added new columns to some Parquet files 
> which did not exist in other files and the new columns were automatically 
> added to my schema, which was great.
> Hereafter you will find the code to replicate my issue :
>  
> {code:java}
> df = data.frame(a= 1,
>                 b= 2)
>  df_2 = data.frame(a = 2,
>                   b = 3,
>                   c = 4)
> write_parquet(df, "C:/Data/test2/df1.parquet")
> write_parquet(df_2, "C:/Data/test2/df2.parquet")
> ds <- arrow::open_dataset(sources = "C:/Data/test2") ; ds_cols <- 
> data.frame(variables = ds$ schema$ names)
> ds
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14939) [R] Problem with new variables in dataset schema

Reply via email to