marklit commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1376871039

   I ran the following on my MBP this morning. PyArrow managed to get within 
1.38x of ClickHouse which is a pretty good speed-up.
   
   ```bash
   $ time \
       ~/Downloads/ch/clickhouse local \
             --input-format JSONEachRow \
             -q "SELECT *
                 FROM table
                 FORMAT Parquet" \
       < California.jsonl \
       > ch.snappy.pq
   ```
   
   The above took 36.316 seconds.
   
   
   ```python
   import pyarrow.parquet
   import pyarrow.json
   
   In [3]: %time table = pyarrow.json.read_json('California.jsonl')
   CPU times: user 26.6 s, sys: 16.5 s, total: 43.1 s
   Wall time: 21.6 s
   
   In [4]: %time pyarrow.parquet.write_table(table, 'pyarrow.snappy.pq', 
row_group_size=37738)
   CPU times: user 8.98 s, sys: 8.37 s, total: 17.3 s
   Wall time: 28.6 s
   ```
   
   Any suggestions on how I can improve on that number would be greatly 
appreciated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to