mapleFU commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1779265992
@mrocklin In best case, the IO and CPU is pipelined, we will waiting for the first group of IO finished, and advacne does the handling logic However, there're few points might affect this: 1. fetching in thread pool doesn't ensure priority. So some part might with the different order than expected. For example, there a 2 column chunks, each chunks has 5 IO. The 5th io might finish earier than the first one 2. As a result, the pattern might become: Waiting for IO -> do cpu things.. To optimizing this, the dataset api might split a file to different row-group, and has a row-group reader (aka `ParquetFragment`..). The fragment will be io and read parallelly. This might helps a bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
