thinkharderdev commented on issue #2489:
URL: 
https://github.com/apache/arrow-datafusion/issues/2489#issuecomment-1127484356

   > Yeah as alluded to by @alamb, my plan is to get the iox code released to 
crates.io so that DataFusion _could_ use it.
   > 
   > There would then be a couple of potential courses of action for DataFusion:
   > 
   > * Do nothing 😄
   > * Migrate to using the `object_store` crate to fetch parquet files to 
local disk. This would potentially fetch more bytes from object storage, but as 
described in [RFC: Spill-To-Disk Object Storage Download 
#2205](https://github.com/apache/arrow-datafusion/issues/2205) this may 
actually be faster than the current approach. It would also be temporary 
pending [Push-Based Parquet Reader 
arrow-rs#1605](https://github.com/apache/arrow-rs/issues/1605)
   > * Wait for [Push-Based Parquet Reader 
arrow-rs#1605](https://github.com/apache/arrow-rs/issues/1605) and then migrate 
to using the `object_store` crate
   
   Wrt fetching to local disk, we have an implementation of (datafusion) 
`ObjectStore` in our project which adopts the S3A approach to minimize the 
number of small range requests. Basically, we set a minimum chunk size for S3 
reads (usually 64K). If a read of less than 64K is requested, we go ahead and 
fetch 64K and buffer it in memory. Subsequent reads that fall within that 
buffer are returned from the in-memory buffer. This minimizes the overhead of 
small range requests from the `PageIterator` while still avoiding reads of 
columns not required for the query. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to