Re: [Rust][DataFusion] profiling TPC-H benchmarks with flamegraphs

Andrew Lamb Wed, 06 Apr 2022 13:18:18 -0700

I am going to try and make profiling in datafusion a bit easier. Feel free
to comment on the ticket[1]


Andrew

[1] https://github.com/apache/arrow-datafusion/issues/2174



On Mon, Mar 21, 2022 at 7:39 PM Bob Tinsman <[email protected]> wrote:

> Wow, hotspot looks very cool! However, I was only able to download v1.1
> which is very slow at processing the perf.data; it took something like 30
> minutes for a perf.data file of about 0.5G. I am resorting to building Qt
> from source, so I can build hotspot from source...😖
>
> On Mon, Mar 21, 2022 at 2:05 PM Daniël Heres <[email protected]>
> wrote:
>
> > Hi Bob,
> >
> > One command I've been using succesfully some time for profiling is as
> > follows (with different flags, csv file works just as well):
> >
> > perf record --call-graph=dwarf ../target/release/tpch benchmark
> datafusion
> > --path [path] --format parquet --query 6 --iterations 1 -n 16
> > And using hotspot (https://github.com/KDAB/hotspot) to load/visualize
> the
> > perf.data file (this takes 1-30s for me depending on the size of the
> file).
> >
> > Best regards,
> >
> > Daniël
> >
> > Op ma 21 mrt. 2022 om 21:12 schreef Bob Tinsman <[email protected]>:
> >
> > > Andrew, thanks for your feedback! I started looking at IOx and pprof,
> and
> > > I'm slowly getting a better picture of DataFusion performance work. In
> > > particular, I can see that IOx is driving some of this (in particular
> > [1]).
> > > I'm still in sponge mode, but I can think of a few useful things to do
> > > around benchmarking/profiling:
> > > - Parameterizing the TPC-H benchmark for scale factor
> > > - Parameterizing criterion benchmarks for data size, etc.
> > > - Instrument the TPC-H benchmark with pprof
> > > - Document profiling with pprof or other means (I realized that
> criterion
> > > has pprof integration already [2])
> > >
> > > I'd welcome anybody's feedback.
> > >
> > > I have a couple questions as well:
> > > - Since pprof is an "internal" profiler (i.e. you write some code to
> > > integrate it), can you point me to how it's integrated in IOx?
> > > - Not really on topic, but can you give some advice on building
> faster? I
> > > am new to Rust and am not sure whether it's Rust or datafusion in
> > > particular. Part of it may be the combo of debug + optimized profile
> > that I
> > > need for profiling; maybe it's time to upgrade my box (ryzen 1700/32G
> > > ram/m.2 ssd), but sometimes you can have a monster system and it's
> still
> > > slow [3].
> > >
> > > Thanks, Bob
> > >
> > >
> > > [1] https://github.com/influxdata/influxdb_iox/issues/3994
> > > [2] https://github.com/tikv/pprof-rs
> > > [3] https://fasterthanli.me/articles/why-is-my-rust-build-so-slow
> > >
> > > On Mon, Mar 21, 2022 at 7:27 AM Andrew Lamb <[email protected]>
> > wrote:
> > >
> > > > Thank you for writing up your findings
> > > >
> > > > If you use the `--mem-table` / `-m` command, the CSV file is read
> once
> > > and
> > > > then the query is executed subsequently
> > > >
> > > > As for better ways of profiling rust, we have had good luck using
> > `pprof`
> > > > [1] in InfluxDB IOx (which also uses DataFusion), so I have mostly
> > never
> > > > tried to profile the tpch benchmark program directly
> > > >
> > > > Making the profiling process easier / documenting it would definitely
> > be
> > > > useful in my opinion
> > > >
> > > > Andrew
> > > >
> > > >
> > > > [1] https://crates.io/crates/pprof
> > > >
> > > > On Fri, Mar 18, 2022 at 6:10 PM Bob Tinsman <[email protected]>
> > wrote:
> > > >
> > > > > I've been diving into DataFusion benchmarking because I'm
> interested
> > in
> > > > > understanding its performance. Here's a summary of my experience
> thus
> > > > far.
> > > > > TL;DR: I was able to do it, but it's very slow, ironically.
> > > > > I'd love to hear about anyone else's experiences or recommendations
> > > > > profiling DataFusion (or any other Rust projects for that matter).
> > > > >
> > > > > I decided to start with the TPC-H benchmarks, which have support in
> > the
> > > > > benchmarks directory [2], and use flamegraphs [1] to visualize CPU
> > > > profile
> > > > > data. Gathering and preparing the profile data can be complicated,
> > but
> > > > > there is a "flamegraph" cargo command [3] which conveniently wraps
> up
> > > the
> > > > > whole process.
> > > > >
> > > > > My steps:
> > > > >
> > > > > Followed the benchmark [2] instructions for generating TPC-H data
> > > > > Tested the DataFusion benchmark for query 1:
> > > > >
> > > > > cd benchmarks
> > > > >
> > > > > cargo run --release --bin tpch -- benchmark datafusion
> --iterations 3
> > > > --path ./data --format tbl --query 1 --batch-size 4096
> > > > >
> > > > > This took about 4 seconds per iteration on my system (Ryzen 1700
> > with a
> > > > > pretty fast SSD).
> > > > >
> > > > > The flamegraph command uses release profile by default but you will
> > > need
> > > > > symbols, so add "debug = 1" under "[profile.release]" in the top
> > level
> > > > > Cargo.toml.
> > > > > I also did top level "cargo clean" to make sure I had symbols for
> > > > > everything.
> > > > >
> > > > > To use flamegraph, just substitute "flamegraph" for "run" in the
> > > original
> > > > > command:
> > > > >
> > > > > cargo flamegraph --release --bin tpch -- benchmark datafusion
> > > > --iterations 3 --path ./data --format tbl --query 1 --batch-size 4096
> > > > >
> > > > > I got the following output:
> > > > > Finished release [optimized + debuginfo] target(s) in 0.13s
> > > > > ...omitting various gripes about kernel symbols
> > > > > Running benchmarks with the following options:
> > DataFusionBenchmarkOpt {
> > > > > query: 1, debug: false, iterations: 3, partitions: 2, batch_size:
> > 4096,
> > > > > path: "./data", file_format: "tbl", mem_table: false, output_path:
> > > None }
> > > > > Query 1 iteration 0 took 4106.1 ms and returned 4 rows
> > > > > Query 1 iteration 1 took 4025.6 ms and returned 4 rows
> > > > > Query 1 iteration 2 took 4048.3 ms and returned 4 rows
> > > > > Query 1 avg time: 4060.00 ms
> > > > > [ perf record: Woken up 591 times to write data ]
> > > > > [ perf record: Captured and wrote 149.619 MB perf.data (18567
> > samples)
> > > ]
> > > > >
> > > > > And then I waited a loooong time; I think I gave it up to 45
> minutes.
> > > > What
> > > > > was it doing? It looks like the flamegraph command was calling perf
> > > (the
> > > > > profiling command) which was then calling addr2line over and over:
> > > > > bob@core-beast:~/projects/arrow-datafusion/benchmarks$ ps -ftpts/1
> > > > > UID          PID    PPID  C STIME TTY          TIME CMD
> > > > > bob       348497  347690  0 Mar07 pts/1    00:00:01 -bash
> > > > > bob      2579267  348497  0 12:31 pts/1    00:00:00
> > > > > /home/bob/.cargo/bin/cargo-flamegraph flamegraph -o q1.svg --bin
> tpch
> > > --
> > > > > benchmark datafusion --iterations 3 --path ./data --format tbl
> > --query
> > > 1
> > > > > --batch-size 4096
> > > > > bob      2579303 2579267  0 12:31 pts/1    00:00:00
> > > > > /usr/lib/linux-tools/5.4.0-100-generic/perf script
> > > > > bob      2580448 2579303  0 12:39 pts/1    00:00:00 sh -c addr2line
> > -e
> > > > >
> > >
> /home/bob/.debug/.build-id/a7/de11851fe633c0abfe59affac522e35a752534/elf
> > > > -i
> > > > > -f 000000000052c4a0
> > > > > bob      2580449 2580448  0 12:39 pts/1    00:00:00 addr2line -e
> > > > >
> > >
> /home/bob/.debug/.build-id/a7/de11851fe633c0abfe59affac522e35a752534/elf
> > > > -i
> > > > > -f 000000000052c4a0
> > > > >
> > > > > I hit ctrl-C and the SVG file was written in spite of me
> interrupting
> > > it.
> > > > > I'm attaching it (q1.svg in the attached tar.gz), and it does show
> > some
> > > > > interesting stuff--reading the CSV files actually takes up the most
> > > time.
> > > > > There is a lot of noise in the stack frames caused by Rust runtime
> > > > > machinery so it can be hard to read.
> > > > >
> > > > > OK, that's nice, but it seems absurd for it to take that long. I
> > > googled
> > > > > this, and apparently perf uses a really inefficient way of mapping
> > > > > addresses; there is a better way but it's not under the right
> license
> > > > [4].
> > > > > Also, it doesn't help that the "tpch" executable is 220M.
> > > > >
> > > > > I tried the workaround, which is downloading and compiling perf.
> This
> > > > made
> > > > > it work in a few seconds but the symbols are all mangled (see
> > > > > q1-myperf.svg). This is what "flamegraph" is supposed to do so
> > > something
> > > > > weird is happening.
> > > > >
> > > > > Thanks if you've read this far...let me know what you think! Are
> > there
> > > > > better ways of profiling Rust (my guess is I might actually have to
> > pay
> > > > for
> > > > > them)?
> > > > > Bob
> > > > >
> > > > > [1] https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
> > > > > [2]
> > https://github.com/apache/arrow-datafusion/tree/master/benchmarks
> > > > > [3] https://github.com/flamegraph-rs/flamegraph
> > > > > [4] https://michcioperz.com/post/slow-perf-script/
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Daniël Heres
> >
>

Re: [Rust][DataFusion] profiling TPC-H benchmarks with flamegraphs

Reply via email to