Carl Boettiger created ARROW-18114:
--------------------------------------

             Summary: [R] unify_schemas=FALSE does not improve open_dataset() 
read times
                 Key: ARROW-18114
                 URL: https://issues.apache.org/jira/browse/ARROW-18114
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Carl Boettiger


open_dataset() provides the very helpful optional argument to set 
unify_schemas=FALSE, which should allow arrow to inspect a single parquet file 
instead of touching potentially thousands or more parquet files to determine a 
consistent unified schema.  This ought to provide a substantial performance 
increase in contexts where the schema is known in advance. 

Unfortunately, in my tests it seems to have no impact on performance.  Consider 
the following reprexes:

default, unify_schemas=TRUE
library(arrow)
 ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min", 
endpoint_override = "data.ecoforecast.org", anonymous=TRUE)

bench::bench_time({
open_dataset(ex) 
})
about 32 seconds for me.

manual, unify_schemas=FALSE:

 
bench::bench_time(\{

open_dataset(ex, unify_schemas = FALSE)

})
takes about 32 seconds as well. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to