Pal created ARROW-14939:
---------------------------

             Summary: [R] Problem with new variables in dataset schema
                 Key: ARROW-14939
                 URL: https://issues.apache.org/jira/browse/ARROW-14939
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 6.0.1
         Environment: 
RStudio Version
--------------------------------------------------
1.4.1717


Session Information
--------------------------------------------------
R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS 12.0.1

Matrix products: default
LAPACK: 
/Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_6.0.1

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1 bit_4.0.4        compiler_4.1.0   magrittr_2.0.1   
assertthat_0.2.1 R6_2.5.1        
 [7] tools_4.1.0      glue_1.5.0       bit64_4.0.5      vctrs_0.3.8      
rlang_0.4.12     purrr_0.3.4     


System Information
--------------------------------------------------
sysname        : Darwin                                                         
                                
release        : 21.1.0                                                         
                                
version        : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; 
root:xnu-8019.41.5~1/RELEASE_X86_64
nodename       :                                                                
    
machine        : x86_64                                                         
                                
login          : root                                                           
                                
user           : os                                                             
                                
effective_user : os                                                             
                                


Platform Information
--------------------------------------------------
OS.type    : unix
file.sep   : /
dynlib.ext : .so
GUI        : RStudio
endian     : little
pkgType    : mac.binary
path.sep   : :
r_arch     : 
            Reporter: Pal


Hi, 

I have a problem with updating the schema in arrow::open_dataset().

For example, let's say I have one parquet file with two columns (a and b) and 
another file with three columns (a and b and c). When I open this dataset, its 
schema will only detect columns a and b. Am I missing something ? From my 
previous experience, I already added new columns to some Parquet files which 
did not exist in other files and the new columns were automatically added to my 
schema, which was great.

Hereafter you will find the code to replicate my issue :

 
{code:java}
df = data.frame(a= 1,
                b= 2)
 df_2 = data.frame(a = 2,
                  b = 3,
                  c = 4)
write_parquet(df, "C:/Data/test2/df1.parquet")
write_parquet(df_2, "C:/Data/test2/df2.parquet")
ds <- arrow::open_dataset(sources = "C:/Data/test2") ; ds_cols <- 
data.frame(variables = ds$ schema$ names)
ds
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to