Re: [I] Parallel Arrow file format reading [arrow-datafusion]

via GitHub Sun, 24 Dec 2023 01:40:08 -0800


my-vegetable-has-exploded commented on issue #8503:
URL: 
https://github.com/apache/arrow-datafusion/issues/8503#issuecomment-1868475220


   I read related pr about parquet and csv.
   Parquet parallel scan is based on rowgroup and csv is based on line. Both of 
them can be splitted by row and then output RecordBatchs using  a certain 
method.
   I don't think arrow can be handled like that, since arrow file is purely 
column-based. 
   But I am wondering whether we can split the scan process into several parts 
and rebuild the whole Batch, since there maybe more than one array in file.
   
![图片](https://github.com/apache/arrow-datafusion/assets/48236141/8e7a8b19-f302-4678-96a8-d2d3af5f4c56)
   
   Merry Christmas!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parallel Arrow file format reading [arrow-datafusion]

Reply via email to