Since it is for users, it should eventually go to the web site. How about a new section under: https://beam.apache.org/documentation/sdks/python/
"Troubleshooting and Tuning" ? On Fri, Nov 16, 2018 at 10:08 AM Ahmet Altay <al...@google.com> wrote: > > > On Fri, Nov 16, 2018 at 2:12 AM, Robert Bradshaw <rober...@google.com> > wrote: > >> One needs to ensure that gprof2dot is importable (i.e. installed via pip >> into your Python environment). >> >> As for specifying the FnApiRunner via the runner argument, --runner can >> take fully qualified names (if it's not in the short list of known >> runners). However, the FnApiRunner is the DirectRunner for non-streaming >> mode, so there's no need to specify it explicitly. >> >> Good point about adding this to the documentation. It's unclear where >> best to put it... >> > > How about in wiki under python tips? ( > https://cwiki.apache.org/confluence/display/BEAM/Python+Tips) From there > it can be later converted to full user docs later. > > >> >> On Thu, Nov 15, 2018 at 5:28 PM Thomas Weise <t...@apache.org> wrote: >> >>> Hi Robert, >>> >>> This is great. It should be added to our Python documentation because >>> users will like need this! >>> >>> After I installed gprof2dot I'm still prompted to install: >>> >>> "Please install gprof2dot and dot for profile renderings." >>> >>> Also is there a way to run a pipeline unmodified with fn_api_runner? >>> (For those interested in profiling the SDK worker.) >>> >>> It works with direct runner, but "FnApiRunner" isn't currently supported >>> as --runner argument: >>> >>> python -m apache_beam.examples.wordcount \ >>> --input=/etc/profile \ >>> --output=/tmp/py-wordcount-direct \ >>> *--runner=FnApiRunner* \ >>> --streaming \ >>> --profile_cpu --profile_location=./build/pyprofile >>> >>> Thanks, >>> Thomas >>> >>> >>> On Mon, Nov 5, 2018 at 7:15 PM Ankur Goenka <goe...@google.com> wrote: >>> >>>> All containers are destroyed by default on termination so to analyze >>>> profiling data for portable runners, either disable container cleanup >>>> (using --retainDockerContainers=true) or use remote distributed file >>>> system path. >>>> >>>> On Mon, Nov 5, 2018 at 1:05 AM Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> Any portable runner should pick it up automatically. >>>>> On Tue, Oct 30, 2018 at 3:32 AM Manu Zhang <owenzhang1...@gmail.com> >>>>> wrote: >>>>> > >>>>> > Cool ! Can we document it somewhere such that other Runners could >>>>> pick it up later ? >>>>> > >>>>> > Thanks, >>>>> > Manu Zhang >>>>> > On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels <m...@apache.org>, >>>>> wrote: >>>>> > >>>>> > This looks very helpful for debugging performance of portable >>>>> pipelines. >>>>> > Great work! >>>>> > >>>>> > Enabling local directories for Flink or other portable Runners would >>>>> be >>>>> > useful for debugging, e.g. per >>>>> > https://issues.apache.org/jira/browse/BEAM-5440 >>>>> > >>>>> > On 26.10.18 18:08, Robert Bradshaw wrote: >>>>> > >>>>> > Now that we've (mostly) moved from features to performance for >>>>> > BeamPython-on-Flink, I've been doing some profiling of Python code, >>>>> > and thought it may be useful for others as well (both those working >>>>> on >>>>> > the SDK, and users who want to understand their own code), so I've >>>>> > tried to wrap this up into something useful. >>>>> > >>>>> > Python already had some existing profile options that we used with >>>>> > Dataflow, specifically --profile_cpu and --profile_location. I've >>>>> > hooked these up to both the DirectRunner and the SDK Harness Worker. >>>>> > One can now run commands like >>>>> > >>>>> > python -m apache_beam.examples.wordcount >>>>> > --output=counts.txt--profile_cpu --profile_location=path/to/directory >>>>> > >>>>> > and get nice graphs like the one attached. (Here the bulk of the time >>>>> > is spent reading from the default input in gcs. Another hint for >>>>> > reading the graph is that due to fusion the call graph is cyclic, >>>>> > passing through operations:86:receive for every output.) >>>>> > >>>>> > The raw python profile stats [1] are produced in that directory, >>>>> along >>>>> > with a dot graph and an svg if both dot and gprof2dot are installed. >>>>> > There is also an important option --direct_runner_bundle_repeat which >>>>> > can be set to gain more accurate profiles on smaller data sets by >>>>> > re-playing the bundle without the (non-trivial) one-time setup costs. >>>>> > >>>>> > These flags also work on portability runners such as Flink, where the >>>>> > directory must be set to a distributed filesystem. Each bundle >>>>> > produces its own profile in that directory, and they can be >>>>> > concatenated and manually fed into tools like below. In that case >>>>> > there is a --profile_sample_rate which can be set to avoid profiling >>>>> > every single bundle (e.g. for a production job). >>>>> > >>>>> > The PR is up at https://github.com/apache/beam/pull/6847 Hope it's >>>>> useful. >>>>> > >>>>> > - Robert >>>>> > >>>>> > >>>>> > [1] https://docs.python.org/2/library/profile.html >>>>> > [2] https://github.com/jrfonseca/gprof2dot >>>>> > >>>>> >>>> >