[GitHub] [arrow-rs] marklit opened a new issue, #3441: Performance improvement for compressing Snappy-compressed Parquet files?

GitBox Tue, 03 Jan 2023 10:48:18 -0800


marklit opened a new issue, #3441:
URL: https://github.com/apache/arrow-rs/issues/3441


   Versions:
   
   * json2parquet 0.6.0 with the following Cargo packages:
     * parquet = "29.0.0" (this is in the main branch but the file metadata 
states 23.0.0 for some reason)
     * arrow = "29.0.0"
     * arrow-schema = { version = "29.0.0", features = ["serde"] }
   * PyArrow 10.0.1
   * ClickHouse 22.13.1.1119
   
   I downloaded the California dataset from 
https://github.com/microsoft/USBuildingFootprints and converted it from JSONL 
into Parquet with json2parquet and ClickHouse. I found json2parquet to be 1.5x 
slower than ClickHouse when it came to converting the records into 
Snappy-compressed Parquet. 
   
   I converted the original GeoJSON into JSONL with three elements per record. 
The resulting JSONL file is 3 GB uncompressed and has 11,542,912 lines. 
   
   ```bash
   $ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
       | jq -c '.properties * {geom: .geometry|tostring}' \
       > California.jsonl
   $ head -n1 California.jsonl | jq .
   ```
   
   ```json
   {
     "release": 1,
     "capture_dates_range": "",
     "geom": 
"{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
   }
   ```
   
   I then converted that file into Snappy-compressed Parquet with ClickHouse 
which took 32 seconds and produced a file 793 MB in size.
   
   ```bash
   $ cat California.jsonl \
       | clickhouse local \
             --input-format JSONEachRow \
             -q "SELECT *
                 FROM table
                 FORMAT Parquet" \
       > cali.snappy.pq
   ```
   
   The following was compiled with rustc 1.66.0 (69f9c33d7 2022-12-12).
   
   ```bash
   $ git clone https://github.com/domoritz/json2parquet/
   $ cd json2parquet
   $ RUSTFLAGS='-Ctarget-cpu=native' cargo build --release
   $ /usr/bin/time -al \
           target/release/json2parquet \
           -c snappy \
           California.jsonl \
           California.snappy.pq
   ```
   
   The above took 43.8 seconds to convert the JSONL into PQ with a file 815 MB 
in size. There are 12 row groups in this PQ file.
   
   ```python
   In [1]: import pyarrow.parquet as pq
   
   In [2]: pf = pq.ParquetFile('California.snappy.pq')
   
   In [3]: pf.schema
   Out[3]: 
   <pyarrow._parquet.ParquetSchema object at 0x109a11380>
   required group field_id=-1 arrow_schema {
     optional binary field_id=-1 capture_dates_range (String);
     optional binary field_id=-1 geom (String);
     optional int64 field_id=-1 release;
   }
   
   In [4]: pf.metadata
   Out[4]: 
   <pyarrow._parquet.FileMetaData object at 0x10adf09f0>
     created_by: parquet-rs version 29.0.0
     num_columns: 3
     num_rows: 11542912
     num_row_groups: 12
     format_version: 1.0
     serialized_size: 7969
   ```
   
   The ClickHouse-produced PQ file has 306 row groups.
   
   ```python
   In [1]: pf = pq.ParquetFile('cali.snappy.pq')
   
   In [2]: pf.schema
   Out[2]: 
   <pyarrow._parquet.ParquetSchema object at 0x105ccc940>
   required group field_id=-1 schema {
     optional int64 field_id=-1 release;
     optional binary field_id=-1 capture_dates_range;
     optional binary field_id=-1 geom;
   }
   
   In [3]: pf.metadata
   Out[3]: 
   <pyarrow._parquet.FileMetaData object at 0x1076705e0>
     created_by: parquet-cpp version 1.5.1-SNAPSHOT
     num_columns: 3
     num_rows: 11542912
     num_row_groups: 306
     format_version: 1.0
     serialized_size: 228389
   ```
   
   I'm not sure if the row group sizes played into the performance delta.
   
   Is there anything I can do to my compilation settings to speed up Parquet 
generation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] marklit opened a new issue, #3441: Performance improvement for compressing Snappy-compressed Parquet files?

Reply via email to