I think Luke's example is interesting, but I wonder how common it is/would
be ? I'd expect failures to happen but not in a rate that would be so
dramatic that it'd be interesting to follow applicatively (I'd expect the
runner/cluster to properly monitor up time of processes/nodes separately).
And even if it is useful, I can't think of other use cases.

I thought the idea was to "declare" the Metrics guarantee level in the
query API, but the more I think about it the more I tend to let it go for
the following reasons:

   - Setting aside Luke's example, I think users would prefer the best
   guarantee a runner can provide. And on that note, I'd expect a "getMetrics"
   API and not have to figure-out guarantees.
   - Programmatic querying would "break" (UnsupportedOperationExecption)
   portability if a program that was running with a runner that supports
   committed() would try to execute on a runner that only supports attempted()
   - I know that portability is for the Pipeline and this is post-execution
   but still, call it 25% portability issue ;-) .
   - According to the Capability Matrix, all runners fail to provide
   "commit" guarantee for Aggregators. I can only speak for Spark saying that
   supporting the Metrics API relies on the same underlying mechanism and so
   nothing will change. I wonder about other runners, anyone plans to support
   "commit" guarantees for Metrics soon ? having said that, not sure this is a
   good reason not to have this as a placeholder.

Another question for querying Metrics - querying by step could be a bit
tricky since a runner is expected to keep unique naming/ids for steps, but
users are supposed to be aware of this here and I'd suspect users might not
follow and if they use the same ParDo in a couple of places they'll query
it and it might be confusing for them to see "double counts" if they didn't
mean for that.

Amit.

On Thu, Jan 19, 2017 at 7:36 PM Ben Chambers <[email protected]>
wrote:

> Thanks for starting the discussion! I'm going to hold off saying what I
> think and instead just provide some background and additional questions,
> because I want to see where the discussion goes.
>
> When I first suggested the API for querying metrics I was adding it for
> parity with aggregators. A good first question might be does the pipeline
> result even need query methods? Runners could add them as necessary based
> on the levels of querying the support.
>
> The other desire was to make the accuracy clear. One implementation path
> was reporting metrics directly from the workers while attempting work. This
> can overcount when retrying and may be under the actual attempts if the
> worker lost connectivity before reporting.
>
> Another implementation was something like a side output where the counts
> are committed as part of each bundles results, and then aggregated. This
> committed value is more accurate and represents the value that occurred
> along the success path of the pipeline.
>
> I suspect there are other possible implementations so trying to make an API
> that expresses all of them is difficult. So:
>
> 1. Does pipeline result need to support querying (which is useful for
> programmatic consumption) or are metrics intended only to get values out of
> a pipeline and into some metrics store?
>
> 2. How should pipeline results indicate the different kinds of metrics?
> What if a runner supported multiple kinds (eg, the runner reports both
> attempted and committed results)? As Luke mentions it may be useful to look
> at both to understand how much retries affected the value.
> On Thu, Jan 19, 2017, 1:42 AM Aviem Zur <[email protected]> wrote:
>
> Hi all,
>
> While working on the implementation of metrics API in Spark runner the
> question of committed vs. attempted results has come up, sparking (no pun
> intended) an interesting conversation.
> (See API: MetricResult
> <
>
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricResult.java
> >
> and
> discussion: PR #1750 <https://github.com/apache/beam/pull/1750>)
>
> The separation of `attempted` and `committed` metric results seems a bit
> unclear.
>
> Seeing that current implementations of aggregators in the different runners
> do not guarantee correctness, one could assume that the metrics API
> implementations will also follow the same guarantees.
>
> If this is correct, then you could assume that only `attempted()` metrics
> results can be fulfilled.
> Would it then be better to just have a single method such as `get()` in the
> API, and have the guarantees of each runner explained in the capability
> matrix / documentation?
>

Reply via email to