[GitHub] [arrow] cboettig opened a new issue, #35715: open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root.

via GitHub Mon, 22 May 2023 11:58:46 -0700


cboettig opened a new issue, #35715:
URL: https://github.com/apache/arrow/issues/35715


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Using `open_dataset()` on a remote S3 root with lots of partition files can 
be quite slow just because listing files on S3 is really slow 
(https://github.com/apache/arrow/issues/34145).  Some of this might be improved 
by https://github.com/apache/arrow/issues/34213, but apparently this slow 
listing is a well-known limitation of the S3 API, and Amazon provides the [S3 
Inventory](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html)
 system precisely because of this limitation.  This allows us to determine the 
URIs to each partition ahead-of-time, which should technically be way faster.  
   
   However, passing a large vector of URIs turns out to be even slower and less 
RAM-efficient!  I'm not sure what's going on, but it feels to me like maybe 
`unify_schemas = FALSE` is being ignored, despite the docs saying it is the 
default setting for a vector of URIs.  
   
   
   Here's what should be a reproducible illustration of the issue.  
   
   ```r
   library(arrow)
   
   s3 <- s3_bucket("neon4cast-scores/parquet/aquatics", endpoint_override = 
"data.ecoforecast.org", anonymous=TRUE)
   
   
   bench::bench_time( # very slow
     ds <- open_dataset(s3)
   )
   
   
   # Can we work around this with pre-computed vector of URIs?  
   
   bench::bench_time( # very slow, but available via S3 Inventory
     all_paths <- s3$ls(recursive=TRUE)
   )
   
   all_paths <- all_paths[grepl("[.]parquet", all_paths)]
   uris <- paste0("s3://neon4cast-scores/parquet/aquatics/", all_paths, 
"?endpoint_override=data.ecoforecast.org")
   
   # should be fast now that we know the URIs ahead of time and avoid the ls() 
overhead.  but wow this is worse!
   bench::bench_time( # incredibly slow & accumulates much higher RAM use
     open_dataset(uris)
   )
   ````
   
   (Also, note that duckdb can open this vector of URIs considerably more 
quickly, if with rather more verbose code.)
   
   
   
   
   Not sure what I am missing here.   Am I wrong in thinking that things should 
be faster using the pre-computed vector of URIs rather than leaving arrow to 
effectively have to do the recursive `ls` itself?  Any idea what makes opening 
the vector of URIs so slow here?  Is there any better alternative strategy in 
this setting (other than 'use fewer partitions', I know that would help but 
can't do so here).    (@westonpace or others probably have some good insight 
here!)
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] cboettig opened a new issue, #35715: open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root.

Reply via email to