Hi Robert, This is great. It should be added to our Python documentation because users will like need this!
After I installed gprof2dot I'm still prompted to install: "Please install gprof2dot and dot for profile renderings." Also is there a way to run a pipeline unmodified with fn_api_runner? (For those interested in profiling the SDK worker.) It works with direct runner, but "FnApiRunner" isn't currently supported as --runner argument: python -m apache_beam.examples.wordcount \ --input=/etc/profile \ --output=/tmp/py-wordcount-direct \ *--runner=FnApiRunner* \ --streaming \ --profile_cpu --profile_location=./build/pyprofile Thanks, Thomas On Mon, Nov 5, 2018 at 7:15 PM Ankur Goenka <[email protected]> wrote: > All containers are destroyed by default on termination so to analyze > profiling data for portable runners, either disable container cleanup > (using --retainDockerContainers=true) or use remote distributed file > system path. > > On Mon, Nov 5, 2018 at 1:05 AM Robert Bradshaw <[email protected]> > wrote: > >> Any portable runner should pick it up automatically. >> On Tue, Oct 30, 2018 at 3:32 AM Manu Zhang <[email protected]> >> wrote: >> > >> > Cool ! Can we document it somewhere such that other Runners could pick >> it up later ? >> > >> > Thanks, >> > Manu Zhang >> > On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels <[email protected]>, >> wrote: >> > >> > This looks very helpful for debugging performance of portable pipelines. >> > Great work! >> > >> > Enabling local directories for Flink or other portable Runners would be >> > useful for debugging, e.g. per >> > https://issues.apache.org/jira/browse/BEAM-5440 >> > >> > On 26.10.18 18:08, Robert Bradshaw wrote: >> > >> > Now that we've (mostly) moved from features to performance for >> > BeamPython-on-Flink, I've been doing some profiling of Python code, >> > and thought it may be useful for others as well (both those working on >> > the SDK, and users who want to understand their own code), so I've >> > tried to wrap this up into something useful. >> > >> > Python already had some existing profile options that we used with >> > Dataflow, specifically --profile_cpu and --profile_location. I've >> > hooked these up to both the DirectRunner and the SDK Harness Worker. >> > One can now run commands like >> > >> > python -m apache_beam.examples.wordcount >> > --output=counts.txt--profile_cpu --profile_location=path/to/directory >> > >> > and get nice graphs like the one attached. (Here the bulk of the time >> > is spent reading from the default input in gcs. Another hint for >> > reading the graph is that due to fusion the call graph is cyclic, >> > passing through operations:86:receive for every output.) >> > >> > The raw python profile stats [1] are produced in that directory, along >> > with a dot graph and an svg if both dot and gprof2dot are installed. >> > There is also an important option --direct_runner_bundle_repeat which >> > can be set to gain more accurate profiles on smaller data sets by >> > re-playing the bundle without the (non-trivial) one-time setup costs. >> > >> > These flags also work on portability runners such as Flink, where the >> > directory must be set to a distributed filesystem. Each bundle >> > produces its own profile in that directory, and they can be >> > concatenated and manually fed into tools like below. In that case >> > there is a --profile_sample_rate which can be set to avoid profiling >> > every single bundle (e.g. for a production job). >> > >> > The PR is up at https://github.com/apache/beam/pull/6847 Hope it's >> useful. >> > >> > - Robert >> > >> > >> > [1] https://docs.python.org/2/library/profile.html >> > [2] https://github.com/jrfonseca/gprof2dot >> > >> >
