Thank you for writing up your findings

If you use the `--mem-table` / `-m` command, the CSV file is read once and
then the query is executed subsequently

As for better ways of profiling rust, we have had good luck using `pprof`
[1] in InfluxDB IOx (which also uses DataFusion), so I have mostly never
tried to profile the tpch benchmark program directly

Making the profiling process easier / documenting it would definitely be
useful in my opinion

Andrew


[1] https://crates.io/crates/pprof

On Fri, Mar 18, 2022 at 6:10 PM Bob Tinsman <bobti...@gmail.com> wrote:

> I've been diving into DataFusion benchmarking because I'm interested in
> understanding its performance. Here's a summary of my experience thus far.
> TL;DR: I was able to do it, but it's very slow, ironically.
> I'd love to hear about anyone else's experiences or recommendations
> profiling DataFusion (or any other Rust projects for that matter).
>
> I decided to start with the TPC-H benchmarks, which have support in the
> benchmarks directory [2], and use flamegraphs [1] to visualize CPU profile
> data. Gathering and preparing the profile data can be complicated, but
> there is a "flamegraph" cargo command [3] which conveniently wraps up the
> whole process.
>
> My steps:
>
> Followed the benchmark [2] instructions for generating TPC-H data
> Tested the DataFusion benchmark for query 1:
>
> cd benchmarks
>
> cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path 
> ./data --format tbl --query 1 --batch-size 4096
>
> This took about 4 seconds per iteration on my system (Ryzen 1700 with a
> pretty fast SSD).
>
> The flamegraph command uses release profile by default but you will need
> symbols, so add "debug = 1" under "[profile.release]" in the top level
> Cargo.toml.
> I also did top level "cargo clean" to make sure I had symbols for
> everything.
>
> To use flamegraph, just substitute "flamegraph" for "run" in the original
> command:
>
> cargo flamegraph --release --bin tpch -- benchmark datafusion --iterations 3 
> --path ./data --format tbl --query 1 --batch-size 4096
>
> I got the following output:
> Finished release [optimized + debuginfo] target(s) in 0.13s
> ...omitting various gripes about kernel symbols
> Running benchmarks with the following options: DataFusionBenchmarkOpt {
> query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096,
> path: "./data", file_format: "tbl", mem_table: false, output_path: None }
> Query 1 iteration 0 took 4106.1 ms and returned 4 rows
> Query 1 iteration 1 took 4025.6 ms and returned 4 rows
> Query 1 iteration 2 took 4048.3 ms and returned 4 rows
> Query 1 avg time: 4060.00 ms
> [ perf record: Woken up 591 times to write data ]
> [ perf record: Captured and wrote 149.619 MB perf.data (18567 samples) ]
>
> And then I waited a loooong time; I think I gave it up to 45 minutes. What
> was it doing? It looks like the flamegraph command was calling perf (the
> profiling command) which was then calling addr2line over and over:
> bob@core-beast:~/projects/arrow-datafusion/benchmarks$ ps -ftpts/1
> UID          PID    PPID  C STIME TTY          TIME CMD
> bob       348497  347690  0 Mar07 pts/1    00:00:01 -bash
> bob      2579267  348497  0 12:31 pts/1    00:00:00
> /home/bob/.cargo/bin/cargo-flamegraph flamegraph -o q1.svg --bin tpch --
> benchmark datafusion --iterations 3 --path ./data --format tbl --query 1
> --batch-size 4096
> bob      2579303 2579267  0 12:31 pts/1    00:00:00
> /usr/lib/linux-tools/5.4.0-100-generic/perf script
> bob      2580448 2579303  0 12:39 pts/1    00:00:00 sh -c addr2line -e
> /home/bob/.debug/.build-id/a7/de11851fe633c0abfe59affac522e35a752534/elf -i
> -f 000000000052c4a0
> bob      2580449 2580448  0 12:39 pts/1    00:00:00 addr2line -e
> /home/bob/.debug/.build-id/a7/de11851fe633c0abfe59affac522e35a752534/elf -i
> -f 000000000052c4a0
>
> I hit ctrl-C and the SVG file was written in spite of me interrupting it.
> I'm attaching it (q1.svg in the attached tar.gz), and it does show some
> interesting stuff--reading the CSV files actually takes up the most time.
> There is a lot of noise in the stack frames caused by Rust runtime
> machinery so it can be hard to read.
>
> OK, that's nice, but it seems absurd for it to take that long. I googled
> this, and apparently perf uses a really inefficient way of mapping
> addresses; there is a better way but it's not under the right license [4].
> Also, it doesn't help that the "tpch" executable is 220M.
>
> I tried the workaround, which is downloading and compiling perf. This made
> it work in a few seconds but the symbols are all mangled (see
> q1-myperf.svg). This is what "flamegraph" is supposed to do so something
> weird is happening.
>
> Thanks if you've read this far...let me know what you think! Are there
> better ways of profiling Rust (my guess is I might actually have to pay for
> them)?
> Bob
>
> [1] https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
> [2] https://github.com/apache/arrow-datafusion/tree/master/benchmarks
> [3] https://github.com/flamegraph-rs/flamegraph
> [4] https://michcioperz.com/post/slow-perf-script/
>
>

Reply via email to