Tom-Newton opened a new issue, #36441:
URL: https://github.com/apache/arrow/issues/36441

   ### Describe the enhancement requested
   
   When doing parquet reads from high latency storage e.g. S3 or Azure blob 
storage its beneficial to use 
`fragment_scan_options=ParquetFragmentScanOptions(pre_buffer=True)`. 
   
   As far as I can tell `pre_buffer=True` does 2 important things:
   1. It reads all the parquet metadata at the beginning then makes all the 
data block reads.
   2. It coalesces adjacent data blocks so that it can read the data in a 
smaller number of larger reads. Coalescing is done based on `CacheOptions`, in 
particular `hole_size_limit` and `range_size_limit`. 
   
   In my use-case I have found that while point 1 is very beneficial for 
performance point 2 is actually detrimental. Point 1 though is still enough to 
make `pre_buffer=True` a net advantage. I found that setting `range_size_limit` 
to a much smaller value (in combination with 
`pyarrow.set_io_thread_count(<large number>)`) can give me ~6X performance 
improvement on my usecase because it increases the parallelism. I would 
therefore like to make these parameters configurable from python. 
   
   If you are interested in my use-case I'm loading a single parquet file from 
US East Azure blob storage to a workstation in the UK. Probably the current 
defaults would be sensible if it wasn't for the trans-Atlantic latency. 
(Pyarrow does not natively support Azure yet but I'm making use of 
https://github.com/apache/arrow/pull/12914 even though its not in an official 
release yet)
   
   I can make a PR that implements this if others agree that its a sensible 
idea.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to