[GitHub] [arrow-datafusion] alamb commented on issue #2205: RFC: Spill-To-Disk Object Storage Download

GitBox Wed, 13 Apr 2022 03:44:55 -0700


alamb commented on issue #2205:
URL: 
https://github.com/apache/arrow-datafusion/issues/2205#issuecomment-1097902278


   I see two major, and somewhat orthogonal usecases:
   
   *Usecase*: Multiple reads of unpredictable column / row group subsets of the 
same file (e.g. IOx)
   *Optimal*: Read data to local file
   
   *Goal*: Single read of a subset of column/row groups (e.g. Cloud Fuse, other 
"analytics on S3 parquet files")
   *Optimal*: Read subset of the data that is needed into memory, discard after 
decode 
   
   I have been hoping our ObjectStore interface would allow for both usecases. 
   
   In terms of the "many small requests to S3" problem, I was imagining that 
the S3 ObjectStore implementation would implement "pre-fetching" internally 
(the same way local filesystems do) to coalesce multiple small requests into 
fewer larger ones.  This strategy is particularly effective if we know what 
parts of the file are likely to be needed.
   
   Conveniently, the parquet format is quite amenable to this (as once the 
reader has figured out it wants to scan a row group, it also knows what file 
data (offsets) it needs). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #2205: RFC: Spill-To-Disk Object Storage Download

Reply via email to