Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Wed, 25 Oct 2023 06:22:27 -0700


mapleFU commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1779265992


   @mrocklin 
   
   In best case, the IO and CPU is pipelined, we will waiting for the first 
group of IO finished, and advacne does the handling logic
   
   However, there're few points might affect this:
   
   1. fetching in thread pool doesn't ensure priority. So some part might with 
the different order than expected. For example, there a 2 column chunks, each 
chunks has 5 IO. The 5th io might finish earier than the first one
   2. As a result, the pattern might become: Waiting for IO -> do cpu things..
   
   To optimizing this, the dataset api might split a file to different 
row-group, and has a row-group reader (aka `ParquetFragment`..). The fragment 
will be io and read parallelly. This might helps a bit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to