cboettig opened a new issue, #35715: URL: https://github.com/apache/arrow/issues/35715
### Describe the bug, including details regarding any error messages, version, and platform. Using `open_dataset()` on a remote S3 root with lots of partition files can be quite slow just because listing files on S3 is really slow (https://github.com/apache/arrow/issues/34145). Some of this might be improved by https://github.com/apache/arrow/issues/34213, but apparently this slow listing is a well-known limitation of the S3 API, and Amazon provides the [S3 Inventory](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html) system precisely because of this limitation. This allows us to determine the URIs to each partition ahead-of-time, which should technically be way faster. However, passing a large vector of URIs turns out to be even slower and less RAM-efficient! I'm not sure what's going on, but it feels to me like maybe `unify_schemas = FALSE` is being ignored, despite the docs saying it is the default setting for a vector of URIs. Here's what should be a reproducible illustration of the issue. ```r library(arrow) s3 <- s3_bucket("neon4cast-scores/parquet/aquatics", endpoint_override = "data.ecoforecast.org", anonymous=TRUE) bench::bench_time( # very slow ds <- open_dataset(s3) ) # Can we work around this with pre-computed vector of URIs? bench::bench_time( # very slow, but available via S3 Inventory all_paths <- s3$ls(recursive=TRUE) ) all_paths <- all_paths[grepl("[.]parquet", all_paths)] uris <- paste0("s3://neon4cast-scores/parquet/aquatics/", all_paths, "?endpoint_override=data.ecoforecast.org") # should be fast now that we know the URIs ahead of time and avoid the ls() overhead. but wow this is worse! bench::bench_time( # incredibly slow & accumulates much higher RAM use open_dataset(uris) ) ```` (Also, note that duckdb can open this vector of URIs considerably more quickly, if with rather more verbose code.) Not sure what I am missing here. Am I wrong in thinking that things should be faster using the pre-computed vector of URIs rather than leaving arrow to effectively have to do the recursive `ls` itself? Any idea what makes opening the vector of URIs so slow here? Is there any better alternative strategy in this setting (other than 'use fewer partitions', I know that would help but can't do so here). (@westonpace or others probably have some good insight here!) ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
