[GitHub] [arrow] westonpace commented on issue #15220: Speed up Parquet Writing?

GitBox Wed, 11 Jan 2023 09:39:10 -0800


westonpace commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1379250418


   I would encourage you to experiment with row group sizes.  For example, from 
a quick experiment (writing ~50M rows) I see about a 1.5x hit using these 
prohibitively small row groups:
   
   ```
   >>> timeit.timeit(lambda: pq.write_table(tab, "/tmp/foo.parquet"), number=5)
   95.93511083399972
   >>> timeit.timeit(lambda: pq.write_table(tab, "/tmp/foo.parquet", 
row_group_size=37738), number=5)
   149.49872442699962
   ```
   
   Beyond writing, a too small row group size is likely to have an even larger 
negative effect on read performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #15220: Speed up Parquet Writing?

Reply via email to