Andrew, thanks for your feedback! I started looking at IOx and pprof, and I'm slowly getting a better picture of DataFusion performance work. In particular, I can see that IOx is driving some of this (in particular [1]). I'm still in sponge mode, but I can think of a few useful things to do around benchmarking/profiling: - Parameterizing the TPC-H benchmark for scale factor - Parameterizing criterion benchmarks for data size, etc. - Instrument the TPC-H benchmark with pprof - Document profiling with pprof or other means (I realized that criterion has pprof integration already [2])
I'd welcome anybody's feedback. I have a couple questions as well: - Since pprof is an "internal" profiler (i.e. you write some code to integrate it), can you point me to how it's integrated in IOx? - Not really on topic, but can you give some advice on building faster? I am new to Rust and am not sure whether it's Rust or datafusion in particular. Part of it may be the combo of debug + optimized profile that I need for profiling; maybe it's time to upgrade my box (ryzen 1700/32G ram/m.2 ssd), but sometimes you can have a monster system and it's still slow [3]. Thanks, Bob [1] https://github.com/influxdata/influxdb_iox/issues/3994 [2] https://github.com/tikv/pprof-rs [3] https://fasterthanli.me/articles/why-is-my-rust-build-so-slow On Mon, Mar 21, 2022 at 7:27 AM Andrew Lamb <al...@influxdata.com> wrote: > Thank you for writing up your findings > > If you use the `--mem-table` / `-m` command, the CSV file is read once and > then the query is executed subsequently > > As for better ways of profiling rust, we have had good luck using `pprof` > [1] in InfluxDB IOx (which also uses DataFusion), so I have mostly never > tried to profile the tpch benchmark program directly > > Making the profiling process easier / documenting it would definitely be > useful in my opinion > > Andrew > > > [1] https://crates.io/crates/pprof > > On Fri, Mar 18, 2022 at 6:10 PM Bob Tinsman <bobti...@gmail.com> wrote: > > > I've been diving into DataFusion benchmarking because I'm interested in > > understanding its performance. Here's a summary of my experience thus > far. > > TL;DR: I was able to do it, but it's very slow, ironically. > > I'd love to hear about anyone else's experiences or recommendations > > profiling DataFusion (or any other Rust projects for that matter). > > > > I decided to start with the TPC-H benchmarks, which have support in the > > benchmarks directory [2], and use flamegraphs [1] to visualize CPU > profile > > data. Gathering and preparing the profile data can be complicated, but > > there is a "flamegraph" cargo command [3] which conveniently wraps up the > > whole process. > > > > My steps: > > > > Followed the benchmark [2] instructions for generating TPC-H data > > Tested the DataFusion benchmark for query 1: > > > > cd benchmarks > > > > cargo run --release --bin tpch -- benchmark datafusion --iterations 3 > --path ./data --format tbl --query 1 --batch-size 4096 > > > > This took about 4 seconds per iteration on my system (Ryzen 1700 with a > > pretty fast SSD). > > > > The flamegraph command uses release profile by default but you will need > > symbols, so add "debug = 1" under "[profile.release]" in the top level > > Cargo.toml. > > I also did top level "cargo clean" to make sure I had symbols for > > everything. > > > > To use flamegraph, just substitute "flamegraph" for "run" in the original > > command: > > > > cargo flamegraph --release --bin tpch -- benchmark datafusion > --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096 > > > > I got the following output: > > Finished release [optimized + debuginfo] target(s) in 0.13s > > ...omitting various gripes about kernel symbols > > Running benchmarks with the following options: DataFusionBenchmarkOpt { > > query: 1, debug: false, iterations: 3, partitions: 2, batch_size: 4096, > > path: "./data", file_format: "tbl", mem_table: false, output_path: None } > > Query 1 iteration 0 took 4106.1 ms and returned 4 rows > > Query 1 iteration 1 took 4025.6 ms and returned 4 rows > > Query 1 iteration 2 took 4048.3 ms and returned 4 rows > > Query 1 avg time: 4060.00 ms > > [ perf record: Woken up 591 times to write data ] > > [ perf record: Captured and wrote 149.619 MB perf.data (18567 samples) ] > > > > And then I waited a loooong time; I think I gave it up to 45 minutes. > What > > was it doing? It looks like the flamegraph command was calling perf (the > > profiling command) which was then calling addr2line over and over: > > bob@core-beast:~/projects/arrow-datafusion/benchmarks$ ps -ftpts/1 > > UID PID PPID C STIME TTY TIME CMD > > bob 348497 347690 0 Mar07 pts/1 00:00:01 -bash > > bob 2579267 348497 0 12:31 pts/1 00:00:00 > > /home/bob/.cargo/bin/cargo-flamegraph flamegraph -o q1.svg --bin tpch -- > > benchmark datafusion --iterations 3 --path ./data --format tbl --query 1 > > --batch-size 4096 > > bob 2579303 2579267 0 12:31 pts/1 00:00:00 > > /usr/lib/linux-tools/5.4.0-100-generic/perf script > > bob 2580448 2579303 0 12:39 pts/1 00:00:00 sh -c addr2line -e > > /home/bob/.debug/.build-id/a7/de11851fe633c0abfe59affac522e35a752534/elf > -i > > -f 000000000052c4a0 > > bob 2580449 2580448 0 12:39 pts/1 00:00:00 addr2line -e > > /home/bob/.debug/.build-id/a7/de11851fe633c0abfe59affac522e35a752534/elf > -i > > -f 000000000052c4a0 > > > > I hit ctrl-C and the SVG file was written in spite of me interrupting it. > > I'm attaching it (q1.svg in the attached tar.gz), and it does show some > > interesting stuff--reading the CSV files actually takes up the most time. > > There is a lot of noise in the stack frames caused by Rust runtime > > machinery so it can be hard to read. > > > > OK, that's nice, but it seems absurd for it to take that long. I googled > > this, and apparently perf uses a really inefficient way of mapping > > addresses; there is a better way but it's not under the right license > [4]. > > Also, it doesn't help that the "tpch" executable is 220M. > > > > I tried the workaround, which is downloading and compiling perf. This > made > > it work in a few seconds but the symbols are all mangled (see > > q1-myperf.svg). This is what "flamegraph" is supposed to do so something > > weird is happening. > > > > Thanks if you've read this far...let me know what you think! Are there > > better ways of profiling Rust (my guess is I might actually have to pay > for > > them)? > > Bob > > > > [1] https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html > > [2] https://github.com/apache/arrow-datafusion/tree/master/benchmarks > > [3] https://github.com/flamegraph-rs/flamegraph > > [4] https://michcioperz.com/post/slow-perf-script/ > > > > >