marklit commented on issue #3441:
URL: https://github.com/apache/arrow-rs/issues/3441#issuecomment-1370686044

   The following was run on version 30.0.0.
   
   ```bash
   $ tail -n6 ~/json2parquet/Cargo.toml
   ```
   
   ```toml
   [dependencies]
   parquet = "30.0.0"
   arrow = "30.0.0"
   arrow-schema = { version = "30.0.0", features = ["serde"] }
   serde_json = "1.0.91"
   clap = { version = "4.0.32", features = ["derive"] }
   ```
   
   I ran this comparison again on a fresh 16-core ``e2-highcpu-16`` VM on GCP. 
ClickHouse took 13.7 seconds and ``json2parquet`` took 56.7 seconds to process 
the 11,542,912-row JSONL file. I noticed ``json2parquet`` was maxing out a 
single core while ClickHouse managed to hit 40-180% across 4 cores according to 
``htop``.
   
   Below is the flamegraph of ``json2parquet`` processing the 11,542,912-row 
JSONL file.
   
   ```bash
   $ echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
   
   $ git clone https://github.com/brendangregg/FlameGraph ~/FlameGraph
   
   $ cd ~/json2parquet
   
   $ RUSTFLAGS='-Ctarget-cpu=native' \
           cargo build --release
   
   $ sudo perf record \
       --call-graph dwarf \
       -- \
       target/release/json2parquet \
       -c snappy \
       ../California.jsonl \
       test.snappy.pq
   
   $ sudo perf script \
       | ~/FlameGraph/stackcollapse-perf.pl \
       > out.perf-folded
   $ ~/FlameGraph/flamegraph.pl \
       out.perf-folded \
       > perf.svg
   ```
   
   
![perf](https://user-images.githubusercontent.com/359316/210525108-cd1f7521-3252-42c9-acff-3d57f00cac8e.svg)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to