State sampler is the only state provider for the Python SDK. This means
that the Metrics module relies on it to attribute metrics to each step; and
the logging module also uses it to attribute logs to each step.
statesampler_slow does not implement the actual sampling, but it does
implement the state-provider functionality that other modules rely on.

As for reducing the sampling frequency - in fact, the problem is that the
lock needs to be acquired on every change of state[1], which is a very hot
path for the Python SDK. In statesampler_slow we do this by appending the
state to a stack defined via a Python list (a very efficient operation in
Python)[2].

"But Pablo, how about the locks that the status-reporting thread needs to
use to check counters while the main thread executes a work item? Surely
this lock is necessary in Python and Cython!"
- The answer here is that we do not do any locking for the Python-only
implementation, because the status-reporting thread only reads data, and we
can handle a few inconsistencies (since final numbers are committed upon
work item completion).
- The sampling thread, on the other hand, writes data into the counters, so
it's necessary[3] to use locking.

I do agree that we should be better at letting the user know that their
msec metrics are not being collected....

[1]
https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_fast.pyx#L215-L220
[2]
https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_slow.py#L53-L55
[3]
https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_fast.pyx#L132-L133

On Mon, Jul 15, 2019 at 5:04 PM Alex Amato <ajam...@google.com> wrote:

> Perhaps no metric at all should be returned, instead of 0, which is an
> incorrect value.
>
> Also, is there a reason to have state_sampler_slow at all then, if its not
> intended to be implemented?
>
> On Mon, Jul 15, 2019 at 5:03 PM Kyle Weaver <kcwea...@google.com> wrote:
>
>> Pablo, what about setting a lower sampling rate? Or would that lead to
>> poor results?
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>> | +16502035555
>>
>>
>> On Mon, Jul 15, 2019 at 4:44 PM Pablo Estrada <pabl...@google.com> wrote:
>>
>>> @Thomas do you think this is a problem of documentation, or a missing
>>> feature?
>>>
>>> We did not add support for it without cython because the cost of locking
>>> and checking every 200ms in Python would be too high - that's why this is
>>> only implemented in the optimized Cython codepath. I think it makes sense
>>> to document this, rather than adding the support, as it would be really
>>> expensive. What are your thoughts?
>>>
>>> Best
>>> -P.
>>>
>>> On Mon, Jul 15, 2019, 1:48 PM Thomas Weise <t...@apache.org> wrote:
>>>
>>>> That's great, but I think the JIRA needs to remain open since w/o
>>>> Cython the metric still doesn't work.
>>>>
>>>> It would however be helpful to add a comment regarding your findings.
>>>>
>>>>
>>>> On Mon, Jul 15, 2019 at 1:46 PM Rakesh Kumar <rakeshku...@lyft.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Installing cython in the application environment fixed the issue. Now
>>>>> I am able to see the operator metrics ({organization_specific_prefix}
>>>>> .operator.beam-metric-pardo_execution_time-process_bundle_
>>>>> msecs-v1.gauge.mean)
>>>>>
>>>>> Thanks Ankur for looking into it and providing support.
>>>>>
>>>>> I am going to close  https://issues.apache.org/jira/browse/BEAM-7058 if
>>>>> no one has any objection?
>>>>>
>>>>>
>>>>> On Thu, Apr 11, 2019 at 7:13 AM Thomas Weise <t...@apache.org> wrote:
>>>>>
>>>>>> Tracked as https://issues.apache.org/jira/browse/BEAM-7058
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 10, 2019 at 11:38 AM Pablo Estrada <pabl...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This sounds like a bug then? +Alex Amato <ajam...@google.com>
>>>>>>>
>>>>>>> On Wed, Apr 10, 2019 at 3:59 AM Maximilian Michels <m...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi @all,
>>>>>>>>
>>>>>>>>  From a quick debugging session, I conclude that the wiring is in
>>>>>>>> place
>>>>>>>> for the Flink Runner. There is a ProgressReporter that reports
>>>>>>>> MonitoringInfos to Flink, in a similar fashion as the "legacy"
>>>>>>>> Runner.
>>>>>>>>
>>>>>>>> The bundle duration metrics are 0, but the element count gets
>>>>>>>> reported
>>>>>>>> correctly. It appears to be an issue of the Python/Java harness
>>>>>>>> because
>>>>>>>> "ProcessBundleProgressResponse" contains only 0 values for the
>>>>>>>> bundle
>>>>>>>> duration.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Max
>>>>>>>>
>>>>>>>> On 04.04.19 19:54, Mikhail Gryzykhin wrote:
>>>>>>>> > Hi everyone,
>>>>>>>> >
>>>>>>>> > Quick summary on python and Dataflow Runner:
>>>>>>>> > Python SDK already reports:
>>>>>>>> > - MSec
>>>>>>>> > - User metrics (int64 and distribution)
>>>>>>>> > - PCollection Element Count
>>>>>>>> > - Work on MeanByteCount for pcollection is ongoing here
>>>>>>>> > <https://github.com/apache/beam/pull/8062>.
>>>>>>>> >
>>>>>>>> > Dataflow Runner:
>>>>>>>> > - all metrics listed above are passed through to Dataflow.
>>>>>>>> >
>>>>>>>> > Ryan can give more information on Flink Runner. I also see
>>>>>>>> Maximilian on
>>>>>>>> > some of relevant PRs, so he might comment on this as well.
>>>>>>>> >
>>>>>>>> > Regards,
>>>>>>>> > Mikhail.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Thu, Apr 4, 2019 at 10:43 AM Pablo Estrada <pabl...@google.com
>>>>>>>> > <mailto:pabl...@google.com>> wrote:
>>>>>>>> >
>>>>>>>> >     Hello guys!
>>>>>>>> >     Alex, Mikhail and Ryan are working on support for metrics in
>>>>>>>> the
>>>>>>>> >     portability framework. The support on the SDK is pretty
>>>>>>>> advanced
>>>>>>>> >     AFAIK*, and the next step is to get the metrics back into the
>>>>>>>> >     runner. Lukazs and myself are working on a project that
>>>>>>>> depends on
>>>>>>>> >     this too, so I'm adding everyone so we can get an idea of
>>>>>>>> what's
>>>>>>>> >     missing.
>>>>>>>> >
>>>>>>>> >     I believe:
>>>>>>>> >     - User metrics are fully wired up in the SDK
>>>>>>>> >     - State sampler (timing) metrics are wired up as well (is that
>>>>>>>> >     right, +Alex Amato <mailto:ajam...@google.com>?)
>>>>>>>> >     - Work is ongoing to send the updates back to Flink.
>>>>>>>> >     - What is the plan for making metrics queriable from Flink?
>>>>>>>> +Ryan
>>>>>>>> >     Williams <mailto:r...@runsascoded.com>
>>>>>>>> >
>>>>>>>> >     Thanks!
>>>>>>>> >     -P.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >     On Wed, Apr 3, 2019 at 12:02 PM Thomas Weise <t...@apache.org
>>>>>>>> >     <mailto:t...@apache.org>> wrote:
>>>>>>>> >
>>>>>>>> >         I believe this is where the metrics are supplied:
>>>>>>>> >
>>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/operations.py
>>>>>>>> >
>>>>>>>> >         git grep process_bundle_msecs   yields results for
>>>>>>>> dataflow
>>>>>>>> >         worker only
>>>>>>>> >
>>>>>>>> >         There isn't any test coverage for the Flink runner:
>>>>>>>> >
>>>>>>>> >
>>>>>>>> https://github.com/apache/beam/blob/d38645ae8758d834c3e819b715a66dd82c78f6d4/sdks/python/apache_beam/runners/portability/flink_runner_test.py#L181
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >         On Wed, Apr 3, 2019 at 10:45 AM Akshay Balwally
>>>>>>>> >         <abalwa...@lyft.com <mailto:abalwa...@lyft.com>> wrote:
>>>>>>>> >
>>>>>>>> >             Should have added- I'm using Python sdk, Flink runner
>>>>>>>> >
>>>>>>>> >             On Wed, Apr 3, 2019 at 10:32 AM Akshay Balwally
>>>>>>>> >             <abalwa...@lyft.com <mailto:abalwa...@lyft.com>>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> >                 Hi,
>>>>>>>> >                 I'm hoping to get metrics on the amount of time
>>>>>>>> spent on
>>>>>>>> >                 each operator, so it seams like the stat
>>>>>>>> >
>>>>>>>> >
>>>>>>>>  
>>>>>>>> {organization_specific_prefix}.operator.beam-metric-pardo_execution_time-process_bundle_msecs-v1.gauge.mean
>>>>>>>> >
>>>>>>>> >                 would be pretty helpful. But in practice, this
>>>>>>>> stat
>>>>>>>> >                 always shows 0, which I interpret as 0
>>>>>>>> milliseconds
>>>>>>>> >                 spent per bundle, which can't be correct (other
>>>>>>>> stats
>>>>>>>> >                 show that the operators are running, and timers
>>>>>>>> within
>>>>>>>> >                 the operators show more reasonable times). Is
>>>>>>>> this a
>>>>>>>> >                 known bug?
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >                 --
>>>>>>>> >                 *Akshay Balwally*
>>>>>>>> >                 Software Engineer
>>>>>>>> >                 937.271.6469 <tel:+19372716469>
>>>>>>>> >                 Lyft <http://www.lyft.com/>
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >             --
>>>>>>>> >             *Akshay Balwally*
>>>>>>>> >             Software Engineer
>>>>>>>> >             937.271.6469 <tel:+19372716469>
>>>>>>>> >             Lyft <http://www.lyft.com/>
>>>>>>>> >
>>>>>>>>
>>>>>>>

Reply via email to