Carl Boettiger created ARROW-18114:
--------------------------------------
Summary: [R] unify_schemas=FALSE does not improve open_dataset()
read times
Key: ARROW-18114
URL: https://issues.apache.org/jira/browse/ARROW-18114
Project: Apache Arrow
Issue Type: Bug
Components: R
Reporter: Carl Boettiger
open_dataset() provides the very helpful optional argument to set
unify_schemas=FALSE, which should allow arrow to inspect a single parquet file
instead of touching potentially thousands or more parquet files to determine a
consistent unified schema. This ought to provide a substantial performance
increase in contexts where the schema is known in advance.
Unfortunately, in my tests it seems to have no impact on performance. Consider
the following reprexes:
default, unify_schemas=TRUE
library(arrow)
ex <- s3_bucket("neon4cast-scores/parquet/terrestrial_30min",
endpoint_override = "data.ecoforecast.org", anonymous=TRUE)
bench::bench_time({
open_dataset(ex)
})
about 32 seconds for me.
manual, unify_schemas=FALSE:
bench::bench_time(\{
open_dataset(ex, unify_schemas = FALSE)
})
takes about 32 seconds as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)