westonpace commented on issue #15220: URL: https://github.com/apache/arrow/issues/15220#issuecomment-1379250418
I would encourage you to experiment with row group sizes. For example, from a quick experiment (writing ~50M rows) I see about a 1.5x hit using these prohibitively small row groups: ``` >>> timeit.timeit(lambda: pq.write_table(tab, "/tmp/foo.parquet"), number=5) 95.93511083399972 >>> timeit.timeit(lambda: pq.write_table(tab, "/tmp/foo.parquet", row_group_size=37738), number=5) 149.49872442699962 ``` Beyond writing, a too small row group size is likely to have an even larger negative effect on read performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
