felipecrv commented on issue #40035:
URL: https://github.com/apache/arrow/issues/40035#issuecomment-1942073463

   > These are all good suggestions but they are a lot more complex. Personally 
I would not be comfortable committing to implement something like that.
   
   Start exposing `read_file_max_concurrency` with a generous value and 
hardcode values for `initial_chunk_size` and `chunk_size` inside `azurefs.cc` 
to your liking, then open a separate issue that I can work on.
   
   My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are 
difficult to tweak in a well-informed way, and (2) preventing future 
optimizations on our side based on real-world workloads.
   
   The default settings of the SDK seem to be very conservative regarding 
parallelization because most SDK users arelikely making full-blob downloads 
inside a system that already manages multiple I/O threads -- that explains 
their huge threshold for parallelization (`>256MB`). We know Arrow workloads 
are probably small to medium `ReadAt` calls to the same file on a single thread 
(a Jupyter notebook driving the calls) so we benefit from the parallelism 
offered by the SDK: lower `initial_chunk_size` and `chunk_size` with a high 
`concurrency`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to