marklit opened a new issue, #3441:
URL: https://github.com/apache/arrow-rs/issues/3441
Versions:
* json2parquet 0.6.0 with the following Cargo packages:
* parquet = "29.0.0" (this is in the main branch but the file metadata
states 23.0.0 for some reason)
* arrow = "29.0.0"
* arrow-schema = { version = "29.0.0", features = ["serde"] }
* PyArrow 10.0.1
* ClickHouse 22.13.1.1119
I downloaded the California dataset from
https://github.com/microsoft/USBuildingFootprints and converted it from JSONL
into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x
slower than ClickHouse when it came to converting the records into
Snappy-compressed Parquet.
I converted the original GeoJSON into JSONL with three elements per record.
The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines.
```bash
$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
| jq -c '.properties * {geom: .geometry|tostring}' \
> California.jsonl
$ head -n1 California.jsonl | jq .
```
```json
{
"release": 1,
"capture_dates_range": "",
"geom":
"{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}
```
I then converted that file into Snappy-compressed Parquet with ClickHouse
which took 32 seconds and produced a file 793 MB in size.
```bash
$ cat California.jsonl \
| clickhouse local \
--input-format JSONEachRow \
-q "SELECT *
FROM table
FORMAT Parquet" \
> cali.snappy.pq
```
The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).
```bash
$ git clone https://github.com/domoritz/json2parquet/
$ cd json2parquet
$ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
$ /usr/bin/time -al \
target/release/json2parquet \
-c snappy \
California.jsonl \
California.snappy.pq
```
The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB
in size. There are 12 row groups in this PQ file.
```python
In [1]: import pyarrow.parquet as pq
In [2]: pf = pq.ParquetFile('California.snappy.pq')
In [3]: pf.schema
Out[3]:
<pyarrow._parquet.ParquetSchema object at 0x109a11380>
required group field_id=-1 arrow_schema {
optional binary field_id=-1 capture_dates_range (String);
optional binary field_id=-1 geom (String);
optional int64 field_id=-1 release;
}
In [4]: pf.metadata
Out[4]:
<pyarrow._parquet.FileMetaData object at 0x10adf09f0>
created_by: parquet-rs version 29.0.0
num_columns: 3
num_rows: 11542912
num_row_groups: 12
format_version: 1.0
serialized_size: 7969
```
The ClickHouse-produced PQ file has 306 row groups.
```python
In [1]: pf = pq.ParquetFile('cali.snappy.pq')
In [2]: pf.schema
Out[2]:
<pyarrow._parquet.ParquetSchema object at 0x105ccc940>
required group field_id=-1 schema {
optional int64 field_id=-1 release;
optional binary field_id=-1 capture_dates_range;
optional binary field_id=-1 geom;
}
In [3]: pf.metadata
Out[3]:
<pyarrow._parquet.FileMetaData object at 0x1076705e0>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 11542912
num_row_groups: 306
format_version: 1.0
serialized_size: 228389
```
I'm not sure if the row group sizes played into the performance delta.
Is there anything I can do to my compilation settings to speed up Parquet
generation?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]