felipecrv commented on issue #40035: URL: https://github.com/apache/arrow/issues/40035#issuecomment-1942073463
> These are all good suggestions but they are a lot more complex. Personally I would not be comfortable committing to implement something like that. Start exposing `read_file_max_concurrency` with a generous value and hardcode values for `initial_chunk_size` and `chunk_size` inside `azurefs.cc` to your liking, then open a separate issue that I can work on. My goal is to avoid prematurely exposing hard-to-tweak settings that (1) are difficult to tweak in a well-informed way, and (2) preventing future optimizations on our side based on real-world workloads. The default settings of the SDK seem to be very conservative regarding parallelization because most SDK users arelikely making full-blob downloads inside a system that already manages multiple I/O threads -- that explains their huge threshold for parallelization (`>256MB`). We know Arrow workloads are probably small to medium `ReadAt` calls to the same file on a single thread (a Jupyter notebook driving the calls) so we benefit from the parallelism offered by the SDK: lower `initial_chunk_size` and `chunk_size` with a high `concurrency`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
