westonpace commented on issue #15220: URL: https://github.com/apache/arrow/issues/15220#issuecomment-1377929093
That row group size is pretty small. Note that we default to 64Mi rows per group. I've found 1Mi rows per group to be reasonable. Much less than that and performance starts to take a hit because the data:metadata ratio starts to get skewed. Did you try with larger row groups? Though I can see that is what you get with clickhouse. Otherwise I don't have any suggestions off the top of my head. If clickhouse supports parallel column encoding that might explain it. One way to test that would be to split the table into N different pieces and write the pieces into N different files in parallel. I have not done any profiling of the pyarrow JSON reader. It's entirely possible all the missing seconds are in the JSON reader. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
