eeroel opened a new issue, #38591:
URL: https://github.com/apache/arrow/issues/38591

   ### Describe the enhancement requested
   
   This recent commit introduced OpenAsync for FileSource which enabled higher 
I/O parallelism for Parquet reading: 
https://github.com/apache/arrow/commit/02de3c1789460304e958936b78d60f824921c250
   
   Here's an example reading a FileSystemDataset from Python, using 
`fragment_readahead = 100` and io concurrency set to 100; Y-axis represents 
files and X-axis is time in seconds, and each point is the relative start time 
of a request (HEAD or GET):
   <img width="955" alt="Screenshot 2023-11-05 at 18 43 54" 
src="https://github.com/apache/arrow/assets/10564706/968deab9-7a7b-4ab5-89c1-cb1c7e9c85ee";>
   
   With the current `main` 
https://github.com/apache/arrow/commit/fc8c6b7dc8287c672b62c62f3a2bd724b3835063 
it seems that the first request for each file is again made from the same 
thread --> the file reads start effectively in sequence and for example 
`fragment_readahead` has limited effect:
   
   <img width="955" alt="Screenshot 2023-11-05 at 18 43 54" 
src="https://github.com/apache/arrow/assets/10564706/274bc29d-63d0-4ad8-a1c8-a45277a9ebd5";>
   
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to