westonpace commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1377929093

   That row group size is pretty small.  Note that we default to 64Mi rows per 
group.  I've found 1Mi rows per group to be reasonable.  Much less than that 
and performance starts to take a hit because the data:metadata ratio starts to 
get skewed.  Did you try with larger row groups?  Though I can see that is what 
you get with clickhouse.
   
   Otherwise I don't have any suggestions off the top of my head.  If 
clickhouse supports parallel column encoding that might explain it.  One way to 
test that would be to split the table into N different pieces and write the 
pieces into N different files in parallel.
   
   I have not done any profiling of the pyarrow JSON reader.  It's entirely 
possible all the missing seconds are in the JSON reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to