Cool ! Can we document it somewhere such that other Runners could pick it up later ?
Thanks, Manu Zhang On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels <m...@apache.org>, wrote: > This looks very helpful for debugging performance of portable pipelines. > Great work! > > Enabling local directories for Flink or other portable Runners would be > useful for debugging, e.g. per > https://issues.apache.org/jira/browse/BEAM-5440 > > On 26.10.18 18:08, Robert Bradshaw wrote: > > Now that we've (mostly) moved from features to performance for > > BeamPython-on-Flink, I've been doing some profiling of Python code, > > and thought it may be useful for others as well (both those working on > > the SDK, and users who want to understand their own code), so I've > > tried to wrap this up into something useful. > > > > Python already had some existing profile options that we used with > > Dataflow, specifically --profile_cpu and --profile_location. I've > > hooked these up to both the DirectRunner and the SDK Harness Worker. > > One can now run commands like > > > > python -m apache_beam.examples.wordcount > > --output=counts.txt--profile_cpu --profile_location=path/to/directory > > > > and get nice graphs like the one attached. (Here the bulk of the time > > is spent reading from the default input in gcs. Another hint for > > reading the graph is that due to fusion the call graph is cyclic, > > passing through operations:86:receive for every output.) > > > > The raw python profile stats [1] are produced in that directory, along > > with a dot graph and an svg if both dot and gprof2dot are installed. > > There is also an important option --direct_runner_bundle_repeat which > > can be set to gain more accurate profiles on smaller data sets by > > re-playing the bundle without the (non-trivial) one-time setup costs. > > > > These flags also work on portability runners such as Flink, where the > > directory must be set to a distributed filesystem. Each bundle > > produces its own profile in that directory, and they can be > > concatenated and manually fed into tools like below. In that case > > there is a --profile_sample_rate which can be set to avoid profiling > > every single bundle (e.g. for a production job). > > > > The PR is up at https://github.com/apache/beam/pull/6847 Hope it's useful. > > > > - Robert > > > > > > [1] https://docs.python.org/2/library/profile.html > > [2] https://github.com/jrfonseca/gprof2dot > >