Re: [I] Parallel NDSON file reading [arrow-datafusion]

via GitHub Sat, 23 Dec 2023 04:10:01 -0800


marvinlanhenke commented on issue #8502:
URL: 
https://github.com/apache/arrow-datafusion/issues/8502#issuecomment-1868281177


   > This should be simpler than CSV, as NDJSON does not typically permit 
unescaped newline characters, so it should just be a case of finding the next 
newline
   
   @tustvold @alamb  
   ...out of curiosity, I was digging into this as well. From my understanding 
(looking at the CSV impl) the `FileGroupPartitioner` and its method 
`repartition_file_groups` are used to create the partitions. However, in this 
case evenly divided by size. 
   
   In order for NDJSON to be split "correctly" (and not in the middle of a JSON 
Object) the FileGroupPartitioner needs a new method to split on newline? Would 
this be a reasonable approach? 
   Then only `fn repartitioned` of trait ExecutionPlan and `fn open` of trait 
FileOpener  need to be implemented.
   
   Thanks for helping out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parallel NDSON file reading [arrow-datafusion]

Reply via email to