State sampler is the only state provider for the Python SDK. This means that the Metrics module relies on it to attribute metrics to each step; and the logging module also uses it to attribute logs to each step. statesampler_slow does not implement the actual sampling, but it does implement the state-provider functionality that other modules rely on.
As for reducing the sampling frequency - in fact, the problem is that the lock needs to be acquired on every change of state[1], which is a very hot path for the Python SDK. In statesampler_slow we do this by appending the state to a stack defined via a Python list (a very efficient operation in Python)[2]. "But Pablo, how about the locks that the status-reporting thread needs to use to check counters while the main thread executes a work item? Surely this lock is necessary in Python and Cython!" - The answer here is that we do not do any locking for the Python-only implementation, because the status-reporting thread only reads data, and we can handle a few inconsistencies (since final numbers are committed upon work item completion). - The sampling thread, on the other hand, writes data into the counters, so it's necessary[3] to use locking. I do agree that we should be better at letting the user know that their msec metrics are not being collected.... [1] https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_fast.pyx#L215-L220 [2] https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_slow.py#L53-L55 [3] https://github.com/pabloem/beam/blob/9cf64f57fcdea9a82360ed9b01a43ef25cdb991d/sdks/python/apache_beam/runners/worker/statesampler_fast.pyx#L132-L133 On Mon, Jul 15, 2019 at 5:04 PM Alex Amato <ajam...@google.com> wrote: > Perhaps no metric at all should be returned, instead of 0, which is an > incorrect value. > > Also, is there a reason to have state_sampler_slow at all then, if its not > intended to be implemented? > > On Mon, Jul 15, 2019 at 5:03 PM Kyle Weaver <kcwea...@google.com> wrote: > >> Pablo, what about setting a lower sampling rate? Or would that lead to >> poor results? >> >> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com >> | +16502035555 >> >> >> On Mon, Jul 15, 2019 at 4:44 PM Pablo Estrada <pabl...@google.com> wrote: >> >>> @Thomas do you think this is a problem of documentation, or a missing >>> feature? >>> >>> We did not add support for it without cython because the cost of locking >>> and checking every 200ms in Python would be too high - that's why this is >>> only implemented in the optimized Cython codepath. I think it makes sense >>> to document this, rather than adding the support, as it would be really >>> expensive. What are your thoughts? >>> >>> Best >>> -P. >>> >>> On Mon, Jul 15, 2019, 1:48 PM Thomas Weise <t...@apache.org> wrote: >>> >>>> That's great, but I think the JIRA needs to remain open since w/o >>>> Cython the metric still doesn't work. >>>> >>>> It would however be helpful to add a comment regarding your findings. >>>> >>>> >>>> On Mon, Jul 15, 2019 at 1:46 PM Rakesh Kumar <rakeshku...@lyft.com> >>>> wrote: >>>> >>>>> >>>>> Installing cython in the application environment fixed the issue. Now >>>>> I am able to see the operator metrics ({organization_specific_prefix} >>>>> .operator.beam-metric-pardo_execution_time-process_bundle_ >>>>> msecs-v1.gauge.mean) >>>>> >>>>> Thanks Ankur for looking into it and providing support. >>>>> >>>>> I am going to close https://issues.apache.org/jira/browse/BEAM-7058 if >>>>> no one has any objection? >>>>> >>>>> >>>>> On Thu, Apr 11, 2019 at 7:13 AM Thomas Weise <t...@apache.org> wrote: >>>>> >>>>>> Tracked as https://issues.apache.org/jira/browse/BEAM-7058 >>>>>> >>>>>> >>>>>> On Wed, Apr 10, 2019 at 11:38 AM Pablo Estrada <pabl...@google.com> >>>>>> wrote: >>>>>> >>>>>>> This sounds like a bug then? +Alex Amato <ajam...@google.com> >>>>>>> >>>>>>> On Wed, Apr 10, 2019 at 3:59 AM Maximilian Michels <m...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi @all, >>>>>>>> >>>>>>>> From a quick debugging session, I conclude that the wiring is in >>>>>>>> place >>>>>>>> for the Flink Runner. There is a ProgressReporter that reports >>>>>>>> MonitoringInfos to Flink, in a similar fashion as the "legacy" >>>>>>>> Runner. >>>>>>>> >>>>>>>> The bundle duration metrics are 0, but the element count gets >>>>>>>> reported >>>>>>>> correctly. It appears to be an issue of the Python/Java harness >>>>>>>> because >>>>>>>> "ProcessBundleProgressResponse" contains only 0 values for the >>>>>>>> bundle >>>>>>>> duration. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Max >>>>>>>> >>>>>>>> On 04.04.19 19:54, Mikhail Gryzykhin wrote: >>>>>>>> > Hi everyone, >>>>>>>> > >>>>>>>> > Quick summary on python and Dataflow Runner: >>>>>>>> > Python SDK already reports: >>>>>>>> > - MSec >>>>>>>> > - User metrics (int64 and distribution) >>>>>>>> > - PCollection Element Count >>>>>>>> > - Work on MeanByteCount for pcollection is ongoing here >>>>>>>> > <https://github.com/apache/beam/pull/8062>. >>>>>>>> > >>>>>>>> > Dataflow Runner: >>>>>>>> > - all metrics listed above are passed through to Dataflow. >>>>>>>> > >>>>>>>> > Ryan can give more information on Flink Runner. I also see >>>>>>>> Maximilian on >>>>>>>> > some of relevant PRs, so he might comment on this as well. >>>>>>>> > >>>>>>>> > Regards, >>>>>>>> > Mikhail. >>>>>>>> > >>>>>>>> > >>>>>>>> > On Thu, Apr 4, 2019 at 10:43 AM Pablo Estrada <pabl...@google.com >>>>>>>> > <mailto:pabl...@google.com>> wrote: >>>>>>>> > >>>>>>>> > Hello guys! >>>>>>>> > Alex, Mikhail and Ryan are working on support for metrics in >>>>>>>> the >>>>>>>> > portability framework. The support on the SDK is pretty >>>>>>>> advanced >>>>>>>> > AFAIK*, and the next step is to get the metrics back into the >>>>>>>> > runner. Lukazs and myself are working on a project that >>>>>>>> depends on >>>>>>>> > this too, so I'm adding everyone so we can get an idea of >>>>>>>> what's >>>>>>>> > missing. >>>>>>>> > >>>>>>>> > I believe: >>>>>>>> > - User metrics are fully wired up in the SDK >>>>>>>> > - State sampler (timing) metrics are wired up as well (is that >>>>>>>> > right, +Alex Amato <mailto:ajam...@google.com>?) >>>>>>>> > - Work is ongoing to send the updates back to Flink. >>>>>>>> > - What is the plan for making metrics queriable from Flink? >>>>>>>> +Ryan >>>>>>>> > Williams <mailto:r...@runsascoded.com> >>>>>>>> > >>>>>>>> > Thanks! >>>>>>>> > -P. >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > On Wed, Apr 3, 2019 at 12:02 PM Thomas Weise <t...@apache.org >>>>>>>> > <mailto:t...@apache.org>> wrote: >>>>>>>> > >>>>>>>> > I believe this is where the metrics are supplied: >>>>>>>> > >>>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/worker/operations.py >>>>>>>> > >>>>>>>> > git grep process_bundle_msecs yields results for >>>>>>>> dataflow >>>>>>>> > worker only >>>>>>>> > >>>>>>>> > There isn't any test coverage for the Flink runner: >>>>>>>> > >>>>>>>> > >>>>>>>> https://github.com/apache/beam/blob/d38645ae8758d834c3e819b715a66dd82c78f6d4/sdks/python/apache_beam/runners/portability/flink_runner_test.py#L181 >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > On Wed, Apr 3, 2019 at 10:45 AM Akshay Balwally >>>>>>>> > <abalwa...@lyft.com <mailto:abalwa...@lyft.com>> wrote: >>>>>>>> > >>>>>>>> > Should have added- I'm using Python sdk, Flink runner >>>>>>>> > >>>>>>>> > On Wed, Apr 3, 2019 at 10:32 AM Akshay Balwally >>>>>>>> > <abalwa...@lyft.com <mailto:abalwa...@lyft.com>> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > Hi, >>>>>>>> > I'm hoping to get metrics on the amount of time >>>>>>>> spent on >>>>>>>> > each operator, so it seams like the stat >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>>> {organization_specific_prefix}.operator.beam-metric-pardo_execution_time-process_bundle_msecs-v1.gauge.mean >>>>>>>> > >>>>>>>> > would be pretty helpful. But in practice, this >>>>>>>> stat >>>>>>>> > always shows 0, which I interpret as 0 >>>>>>>> milliseconds >>>>>>>> > spent per bundle, which can't be correct (other >>>>>>>> stats >>>>>>>> > show that the operators are running, and timers >>>>>>>> within >>>>>>>> > the operators show more reasonable times). Is >>>>>>>> this a >>>>>>>> > known bug? >>>>>>>> > >>>>>>>> > >>>>>>>> > -- >>>>>>>> > *Akshay Balwally* >>>>>>>> > Software Engineer >>>>>>>> > 937.271.6469 <tel:+19372716469> >>>>>>>> > Lyft <http://www.lyft.com/> >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > -- >>>>>>>> > *Akshay Balwally* >>>>>>>> > Software Engineer >>>>>>>> > 937.271.6469 <tel:+19372716469> >>>>>>>> > Lyft <http://www.lyft.com/> >>>>>>>> > >>>>>>>> >>>>>>>