devinjdangelo opened a new pull request, #7452: URL: https://github.com/apache/arrow-datafusion/pull/7452
## Which issue does this PR close? Part of #7079 ## Rationale for this change Serialization of "stateless" file types (where the serialized bytes of each record batch has no dependency on the serialized bytes of any other record batch), can be parallelized efficiently across all available CPU cores for a significant decrease in the time needed to write out the file. This PR uses tokio task spawning to parallelize file serialization. There is likely a tradeoff between write speed and memory utilization. If the ObjectStore writer cannot keep up with the data being serialized, bytes could accumulate in memory. ObjectStore puts are concurrent but not parallelized so the risk of higher memory usage increases as the number of cores in the system increases. ## What changes are included in this PR? Spawn a tokio task to serialize each each record batch. TODOs - [ ] Test error handling after these changes, ensure writes are still atomic - [ ] Benchmark write speed and memory consumption for CSV/JSON in this PR vs main ## Are these changes tested? Yes by existing tests, but more tests needed to verify abort behavior and performance ## Are there any user-facing changes? No, just faster writes! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
