Re: [I] Parallel Arrow file format reading [arrow-datafusion]

via GitHub Sun, 24 Dec 2023 04:44:13 -0800


alamb commented on issue #8503:
URL: 
https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1868508443


   > But I am wondering whether we can split the scan process into several 
parts and rebuild the whole Batch, since there maybe more than one array in 
file.
   
   This sounds like a good idea to me in theory -- I am not sure how easy/hard 
it would be to do with the existing arrow IPC reader
   
   In general, the strategy for paralleizing Paruqet and CSV is to be to split 
up the file by ranges,  and then have each of the `ArrowFileReader`s partitions 
read row groups (or CSV lines) that have their first byte within their assigned 
rnage
   
   Perhaps we could do the same for arrow files which could use the first byte 
of the RecordBatches 🤔 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Reply via email to