[GitHub] [arrow-datafusion] devinjdangelo opened a new pull request, #7452: Parallelize Stateless (CSV/JSON) File Write Serialization

via GitHub Wed, 30 Aug 2023 12:40:46 -0700


devinjdangelo opened a new pull request, #7452:
URL: https://github.com/apache/arrow-datafusion/pull/7452


   ## Which issue does this PR close?
   
   Part of #7079
   
   ## Rationale for this change
   
   Serialization of "stateless" file types (where the serialized bytes of each 
record batch has no dependency on the serialized bytes of any other record 
batch), can be parallelized efficiently across all available CPU cores for a 
significant decrease in the time needed to write out the file. This PR uses 
tokio task spawning to parallelize file serialization.
   
   There is likely a tradeoff between write speed and memory utilization. If 
the ObjectStore writer cannot keep up with the data being serialized, bytes 
could accumulate in memory. ObjectStore puts are concurrent but not 
parallelized so the risk of higher memory usage increases as the number of 
cores in the system increases. 
   
   ## What changes are included in this PR?
   
   Spawn a tokio task to serialize each each record batch.
   
   TODOs
   
   - [ ] Test error handling after these changes, ensure writes are still atomic
   - [ ] Benchmark write speed and memory consumption for CSV/JSON in this PR 
vs main 
   
   ## Are these changes tested?
   
   Yes by existing tests, but more tests needed to verify abort behavior and 
performance
   
   ## Are there any user-facing changes?
   
   No, just faster writes!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo opened a new pull request, #7452: Parallelize Stateless (CSV/JSON) File Write Serialization

Reply via email to