I think Luke's example is interesting, but I wonder how common it is/would be ? I'd expect failures to happen but not in a rate that would be so dramatic that it'd be interesting to follow applicatively (I'd expect the runner/cluster to properly monitor up time of processes/nodes separately). And even if it is useful, I can't think of other use cases.
I thought the idea was to "declare" the Metrics guarantee level in the query API, but the more I think about it the more I tend to let it go for the following reasons: - Setting aside Luke's example, I think users would prefer the best guarantee a runner can provide. And on that note, I'd expect a "getMetrics" API and not have to figure-out guarantees. - Programmatic querying would "break" (UnsupportedOperationExecption) portability if a program that was running with a runner that supports committed() would try to execute on a runner that only supports attempted() - I know that portability is for the Pipeline and this is post-execution but still, call it 25% portability issue ;-) . - According to the Capability Matrix, all runners fail to provide "commit" guarantee for Aggregators. I can only speak for Spark saying that supporting the Metrics API relies on the same underlying mechanism and so nothing will change. I wonder about other runners, anyone plans to support "commit" guarantees for Metrics soon ? having said that, not sure this is a good reason not to have this as a placeholder. Another question for querying Metrics - querying by step could be a bit tricky since a runner is expected to keep unique naming/ids for steps, but users are supposed to be aware of this here and I'd suspect users might not follow and if they use the same ParDo in a couple of places they'll query it and it might be confusing for them to see "double counts" if they didn't mean for that. Amit. On Thu, Jan 19, 2017 at 7:36 PM Ben Chambers <[email protected]> wrote: > Thanks for starting the discussion! I'm going to hold off saying what I > think and instead just provide some background and additional questions, > because I want to see where the discussion goes. > > When I first suggested the API for querying metrics I was adding it for > parity with aggregators. A good first question might be does the pipeline > result even need query methods? Runners could add them as necessary based > on the levels of querying the support. > > The other desire was to make the accuracy clear. One implementation path > was reporting metrics directly from the workers while attempting work. This > can overcount when retrying and may be under the actual attempts if the > worker lost connectivity before reporting. > > Another implementation was something like a side output where the counts > are committed as part of each bundles results, and then aggregated. This > committed value is more accurate and represents the value that occurred > along the success path of the pipeline. > > I suspect there are other possible implementations so trying to make an API > that expresses all of them is difficult. So: > > 1. Does pipeline result need to support querying (which is useful for > programmatic consumption) or are metrics intended only to get values out of > a pipeline and into some metrics store? > > 2. How should pipeline results indicate the different kinds of metrics? > What if a runner supported multiple kinds (eg, the runner reports both > attempted and committed results)? As Luke mentions it may be useful to look > at both to understand how much retries affected the value. > On Thu, Jan 19, 2017, 1:42 AM Aviem Zur <[email protected]> wrote: > > Hi all, > > While working on the implementation of metrics API in Spark runner the > question of committed vs. attempted results has come up, sparking (no pun > intended) an interesting conversation. > (See API: MetricResult > < > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricResult.java > > > and > discussion: PR #1750 <https://github.com/apache/beam/pull/1750>) > > The separation of `attempted` and `committed` metric results seems a bit > unclear. > > Seeing that current implementations of aggregators in the different runners > do not guarantee correctness, one could assume that the metrics API > implementations will also follow the same guarantees. > > If this is correct, then you could assume that only `attempted()` metrics > results can be fulfilled. > Would it then be better to just have a single method such as `get()` in the > API, and have the guarantees of each runner explained in the capability > matrix / documentation? >
