Re: Python profiling

Thomas Weise Fri, 16 Nov 2018 10:13:07 -0800

Since it is for users, it should eventually go to the web site.

How about a new section under:
https://beam.apache.org/documentation/sdks/python/


"Troubleshooting and Tuning" ?


On Fri, Nov 16, 2018 at 10:08 AM Ahmet Altay <al...@google.com> wrote:

>
>
> On Fri, Nov 16, 2018 at 2:12 AM, Robert Bradshaw <rober...@google.com>
> wrote:
>
>> One needs to ensure that gprof2dot is importable (i.e. installed via pip
>> into your Python environment).
>>
>> As for specifying the FnApiRunner via the runner argument, --runner can
>> take fully qualified names (if it's not in the short list of known
>> runners). However, the FnApiRunner is the DirectRunner for non-streaming
>> mode, so there's no need to specify it explicitly.
>>
>> Good point about adding this to the documentation. It's unclear where
>> best to put it...
>>
>
> How about in wiki under python tips? (
> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips) From there
> it can be later converted to full user docs later.
>
>
>>
>> On Thu, Nov 15, 2018 at 5:28 PM Thomas Weise <t...@apache.org> wrote:
>>
>>> Hi Robert,
>>>
>>> This is great. It should be added to our Python documentation because
>>> users will like need this!
>>>
>>> After I installed gprof2dot I'm still prompted to install:
>>>
>>> "Please install gprof2dot and dot for profile renderings."
>>>
>>> Also is there a way to run a pipeline unmodified with fn_api_runner?
>>> (For those interested in profiling the SDK worker.)
>>>
>>> It works with direct runner, but "FnApiRunner" isn't currently supported
>>> as --runner argument:
>>>
>>> python -m apache_beam.examples.wordcount \
>>>   --input=/etc/profile \
>>>   --output=/tmp/py-wordcount-direct \
>>>   *--runner=FnApiRunner* \
>>>   --streaming \
>>>   --profile_cpu --profile_location=./build/pyprofile
>>>
>>> Thanks,
>>> Thomas
>>>
>>>
>>> On Mon, Nov 5, 2018 at 7:15 PM Ankur Goenka <goe...@google.com> wrote:
>>>
>>>> All containers are destroyed by default on termination so to analyze
>>>> profiling data for portable runners, either disable container cleanup
>>>> (using --retainDockerContainers=true) or use remote distributed file
>>>> system path.
>>>>
>>>> On Mon, Nov 5, 2018 at 1:05 AM Robert Bradshaw <rober...@google.com>
>>>> wrote:
>>>>
>>>>> Any portable runner should pick it up automatically.
>>>>> On Tue, Oct 30, 2018 at 3:32 AM Manu Zhang <owenzhang1...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Cool ! Can we document it somewhere such that other Runners could
>>>>> pick it up later ?
>>>>> >
>>>>> > Thanks,
>>>>> > Manu Zhang
>>>>> > On Oct 29, 2018, 5:54 PM +0800, Maximilian Michels <m...@apache.org>,
>>>>> wrote:
>>>>> >
>>>>> > This looks very helpful for debugging performance of portable
>>>>> pipelines.
>>>>> > Great work!
>>>>> >
>>>>> > Enabling local directories for Flink or other portable Runners would
>>>>> be
>>>>> > useful for debugging, e.g. per
>>>>> > https://issues.apache.org/jira/browse/BEAM-5440
>>>>> >
>>>>> > On 26.10.18 18:08, Robert Bradshaw wrote:
>>>>> >
>>>>> > Now that we've (mostly) moved from features to performance for
>>>>> > BeamPython-on-Flink, I've been doing some profiling of Python code,
>>>>> > and thought it may be useful for others as well (both those working
>>>>> on
>>>>> > the SDK, and users who want to understand their own code), so I've
>>>>> > tried to wrap this up into something useful.
>>>>> >
>>>>> > Python already had some existing profile options that we used with
>>>>> > Dataflow, specifically --profile_cpu and --profile_location. I've
>>>>> > hooked these up to both the DirectRunner and the SDK Harness Worker.
>>>>> > One can now run commands like
>>>>> >
>>>>> > python -m apache_beam.examples.wordcount
>>>>> > --output=counts.txt--profile_cpu --profile_location=path/to/directory
>>>>> >
>>>>> > and get nice graphs like the one attached. (Here the bulk of the time
>>>>> > is spent reading from the default input in gcs. Another hint for
>>>>> > reading the graph is that due to fusion the call graph is cyclic,
>>>>> > passing through operations:86:receive for every output.)
>>>>> >
>>>>> > The raw python profile stats [1] are produced in that directory,
>>>>> along
>>>>> > with a dot graph and an svg if both dot and gprof2dot are installed.
>>>>> > There is also an important option --direct_runner_bundle_repeat which
>>>>> > can be set to gain more accurate profiles on smaller data sets by
>>>>> > re-playing the bundle without the (non-trivial) one-time setup costs.
>>>>> >
>>>>> > These flags also work on portability runners such as Flink, where the
>>>>> > directory must be set to a distributed filesystem. Each bundle
>>>>> > produces its own profile in that directory, and they can be
>>>>> > concatenated and manually fed into tools like below. In that case
>>>>> > there is a --profile_sample_rate which can be set to avoid profiling
>>>>> > every single bundle (e.g. for a production job).
>>>>> >
>>>>> > The PR is up at https://github.com/apache/beam/pull/6847 Hope it's
>>>>> useful.
>>>>> >
>>>>> > - Robert
>>>>> >
>>>>> >
>>>>> > [1] https://docs.python.org/2/library/profile.html
>>>>> > [2] https://github.com/jrfonseca/gprof2dot
>>>>> >
>>>>>
>>>>
>

Re: Python profiling

Reply via email to