Re: Committed vs. attempted metrics results

Ben Chambers Mon, 23 Jan 2017 16:03:31 -0800

For the short term, it seems like staying with the existing Query API and
allowing runner's to throw exceptions if a user issues a query that is not
supported is reasonable. It shouldn't affect the ability to run a Beam
pipeline on other runners, since the Query API is only exposed *after* the
pipeline is run.


For the longer term, it seems like the Query API could merit some more
thought, especially if people have good use cases for accessing the value
of metrics programatically from the same program that ran the original
pipeline.

Aviem -- is there anything specific that needs to be discussed/nailed down
to help with your PR?

On Thu, Jan 19, 2017 at 3:57 PM Ben Chambers <[email protected]> wrote:

> On Thu, Jan 19, 2017 at 3:28 PM Amit Sela <[email protected]> wrote:
>
> On Fri, Jan 20, 2017 at 1:18 AM Ben Chambers <[email protected]
> >
> wrote:
>
> > Part of the problem here is that whether attempted or committed is "what
> > matters" depends on the metric. If you're counting the number of RPCs to
> an
> > external service, you may want all the attempts (to predict your bill).
> If
> > you're tracking how long those RPCs took, you may want it just to be the
> > committed (eg., what is the best-case time to execute your pipeline).
> This
> > is essentially Luke's case of wanting one or the other.
> >
> This sounds like Metrics should have a user-defined guarantee-level.. which
> might make more sense - Metrics.counter().attempted()/committed() - though
> this might prove more challenging for runners to implement.
>
>
> We went away from that because of cases like Luke's where a user might
> want to compare the two. Or, not even realize there is a difference
> up-front, so declaring ahead of time is difficult. If both are available
> and can be looked at, if they're the same -- no problems. If they're
> different, then it provides a good reason to investigate and figure out the
> difference.
>
>
> >
> > Regarding the step names -- the metrics are reported using the full step
> > name, which is also made unique during the Graph API. So "My
> > Composite/ParDo(MyDoFn)" or "My Composite/ParDo(MyDoFn)2" if there are
> > multiple instances within the same composite. Specifically -- the names
> are
> > made unique prior to recording metrics, so there are no double counts.
> >
> But how would the user know that ? I'm afraid this could be confusing as a
> user-facing query API, and I think most users would simply name metrics
> differently.
>
>
> The query API can support querying based on a sub-string of the full name,
> and return metrics from all steps that match. That would allow the user to
> query metrics without knowing that. Having unique names for steps is
> important and useful for many other things (logging, associating time spent
> executing, etc.).
>
>
> >
> > On Thu, Jan 19, 2017 at 1:57 PM Amit Sela <[email protected]> wrote:
> >
> > > I think Luke's example is interesting, but I wonder how common it
> > is/would
> > > be ? I'd expect failures to happen but not in a rate that would be so
> > > dramatic that it'd be interesting to follow applicatively (I'd expect
> the
> > > runner/cluster to properly monitor up time of processes/nodes
> > separately).
> > > And even if it is useful, I can't think of other use cases.
> > >
> > > I thought the idea was to "declare" the Metrics guarantee level in the
> > > query API, but the more I think about it the more I tend to let it go
> for
> > > the following reasons:
> > >
> > >    - Setting aside Luke's example, I think users would prefer the best
> > >    guarantee a runner can provide. And on that note, I'd expect a
> > > "getMetrics"
> > >    API and not have to figure-out guarantees.
> > >    - Programmatic querying would "break"
> (UnsupportedOperationExecption)
> > >    portability if a program that was running with a runner that
> supports
> > >    committed() would try to execute on a runner that only supports
> > > attempted()
> > >    - I know that portability is for the Pipeline and this is
> > post-execution
> > >    but still, call it 25% portability issue ;-) .
> > >    - According to the Capability Matrix, all runners fail to provide
> > >    "commit" guarantee for Aggregators. I can only speak for Spark
> saying
> > > that
> > >    supporting the Metrics API relies on the same underlying mechanism
> and
> > > so
> > >    nothing will change. I wonder about other runners, anyone plans to
> > > support
> > >    "commit" guarantees for Metrics soon ? having said that, not sure
> this
> > > is a
> > >    good reason not to have this as a placeholder.
> > >
> > > Another question for querying Metrics - querying by step could be a bit
> > > tricky since a runner is expected to keep unique naming/ids for steps,
> > but
> > > users are supposed to be aware of this here and I'd suspect users might
> > not
> > > follow and if they use the same ParDo in a couple of places they'll
> query
> > > it and it might be confusing for them to see "double counts" if they
> > didn't
> > > mean for that.
> > >
> > > Amit.
> > >
> > > On Thu, Jan 19, 2017 at 7:36 PM Ben Chambers
> > <[email protected]
> > > >
> > > wrote:
> > >
> > > > Thanks for starting the discussion! I'm going to hold off saying
> what I
> > > > think and instead just provide some background and additional
> > questions,
> > > > because I want to see where the discussion goes.
> > > >
> > > > When I first suggested the API for querying metrics I was adding it
> for
> > > > parity with aggregators. A good first question might be does the
> > pipeline
> > > > result even need query methods? Runners could add them as necessary
> > based
> > > > on the levels of querying the support.
> > > >
> > > > The other desire was to make the accuracy clear. One implementation
> > path
> > > > was reporting metrics directly from the workers while attempting
> work.
> > > This
> > > > can overcount when retrying and may be under the actual attempts if
> the
> > > > worker lost connectivity before reporting.
> > > >
> > > > Another implementation was something like a side output where the
> > counts
> > > > are committed as part of each bundles results, and then aggregated.
> > This
> > > > committed value is more accurate and represents the value that
> occurred
> > > > along the success path of the pipeline.
> > > >
> > > > I suspect there are other possible implementations so trying to make
> an
> > > API
> > > > that expresses all of them is difficult. So:
> > > >
> > > > 1. Does pipeline result need to support querying (which is useful for
> > > > programmatic consumption) or are metrics intended only to get values
> > out
> > > of
> > > > a pipeline and into some metrics store?
> > > >
> > > > 2. How should pipeline results indicate the different kinds of
> metrics?
> > > > What if a runner supported multiple kinds (eg, the runner reports
> both
> > > > attempted and committed results)? As Luke mentions it may be useful
> to
> > > look
> > > > at both to understand how much retries affected the value.
> > > > On Thu, Jan 19, 2017, 1:42 AM Aviem Zur <[email protected]> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > While working on the implementation of metrics API in Spark runner
> the
> > > > question of committed vs. attempted results has come up, sparking (no
> > pun
> > > > intended) an interesting conversation.
> > > > (See API: MetricResult
> > > > <
> > > >
> > > >
> > >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricResult.java
> > > > >
> > > > and
> > > > discussion: PR #1750 <https://github.com/apache/beam/pull/1750>)
> > > >
> > > > The separation of `attempted` and `committed` metric results seems a
> > bit
> > > > unclear.
> > > >
> > > > Seeing that current implementations of aggregators in the different
> > > runners
> > > > do not guarantee correctness, one could assume that the metrics API
> > > > implementations will also follow the same guarantees.
> > > >
> > > > If this is correct, then you could assume that only `attempted()`
> > metrics
> > > > results can be fulfilled.
> > > > Would it then be better to just have a single method such as `get()`
> in
> > > the
> > > > API, and have the guarantees of each runner explained in the
> capability
> > > > matrix / documentation?
> > > >
> > >
> >
>
>

Re: Committed vs. attempted metrics results

Reply via email to