[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

via GitHub Mon, 25 Sep 2023 16:39:57 -0700


devinjdangelo commented on issue #1718:
URL: https://github.com/apache/arrow-rs/issues/1718#issuecomment-1734611359


   > Option 1 is likely the most tractable, ArrowWriter already encodes columns 
to separate memory regions and then stitches the encoded column chunks 
together. I could conceive doing something similar for a parallel writer.
   
   @tustvold your intial intuition was spot on! I reworked the datafusion 
parallel parquet writer to primarily use column wise parallelization. It is 
around 20% faster and 90% lower memory overhead vs. the previous attempt.
   
   PRs open with more details for this new approach:
   - https://github.com/apache/arrow-rs/pull/4859
   - https://github.com/apache/arrow-datafusion/pull/7655


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] devinjdangelo commented on issue #1718: Support encoding a single parquet file using multiple threads

Reply via email to