marklit commented on issue #15220:
URL: https://github.com/apache/arrow/issues/15220#issuecomment-1376871039
I ran the following on my MBP this morning. PyArrow managed to get within
1.38x of ClickHouse which is a pretty good speed-up.
```bash
$ time \
~/Downloads/ch/clickhouse local \
--input-format JSONEachRow \
-q "SELECT *
FROM table
FORMAT Parquet" \
< California.jsonl \
> ch.snappy.pq
```
The above took 36.316 seconds.
```python
import pyarrow.parquet
import pyarrow.json
In [3]: %time table = pyarrow.json.read_json('California.jsonl')
CPU times: user 26.6 s, sys: 16.5 s, total: 43.1 s
Wall time: 21.6 s
In [4]: %time pyarrow.parquet.write_table(table, 'pyarrow.snappy.pq',
row_group_size=37738)
CPU times: user 8.98 s, sys: 8.37 s, total: 17.3 s
Wall time: 28.6 s
```
Any suggestions on how I can improve on that number would be greatly
appreciated.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]