Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Wed, 25 Oct 2023 06:32:13 -0700


lidavidm commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1779285908


   re: the pipelining/IO discussion, you may find the discussion here 
interesting: https://lists.apache.org/thread/cdfkm8oflm2zvd25yn4k6gh2o7pc9z88
   
   Some (but not all) of those proposals were implemented in Arrow 
("pre-buffering" primarily), though pre-buffering is probably not the ideal way 
to implement it (too much memory usage). One thing that didn't make it was the 
global concurrency manager, which would have approximated priority by not 
actually issuing reads for a file until all reads for previous files have been 
issued (of course, this only makes sense if there's an ordering between the 
files in the first place - not necessarily true for dataset)
   
   That said, I believe datasets does parallelize at the row-group level 
already @mapleFU 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to