On Fri, Jan 20, 2017 at 1:18 AM Ben Chambers <bchamb...@google.com.invalid> wrote:
> Part of the problem here is that whether attempted or committed is "what > matters" depends on the metric. If you're counting the number of RPCs to an > external service, you may want all the attempts (to predict your bill). If > you're tracking how long those RPCs took, you may want it just to be the > committed (eg., what is the best-case time to execute your pipeline). This > is essentially Luke's case of wanting one or the other. > This sounds like Metrics should have a user-defined guarantee-level.. which might make more sense - Metrics.counter().attempted()/committed() - though this might prove more challenging for runners to implement. > > Regarding the step names -- the metrics are reported using the full step > name, which is also made unique during the Graph API. So "My > Composite/ParDo(MyDoFn)" or "My Composite/ParDo(MyDoFn)2" if there are > multiple instances within the same composite. Specifically -- the names are > made unique prior to recording metrics, so there are no double counts. > But how would the user know that ? I'm afraid this could be confusing as a user-facing query API, and I think most users would simply name metrics differently. > > On Thu, Jan 19, 2017 at 1:57 PM Amit Sela <amitsel...@gmail.com> wrote: > > > I think Luke's example is interesting, but I wonder how common it > is/would > > be ? I'd expect failures to happen but not in a rate that would be so > > dramatic that it'd be interesting to follow applicatively (I'd expect the > > runner/cluster to properly monitor up time of processes/nodes > separately). > > And even if it is useful, I can't think of other use cases. > > > > I thought the idea was to "declare" the Metrics guarantee level in the > > query API, but the more I think about it the more I tend to let it go for > > the following reasons: > > > > - Setting aside Luke's example, I think users would prefer the best > > guarantee a runner can provide. And on that note, I'd expect a > > "getMetrics" > > API and not have to figure-out guarantees. > > - Programmatic querying would "break" (UnsupportedOperationExecption) > > portability if a program that was running with a runner that supports > > committed() would try to execute on a runner that only supports > > attempted() > > - I know that portability is for the Pipeline and this is > post-execution > > but still, call it 25% portability issue ;-) . > > - According to the Capability Matrix, all runners fail to provide > > "commit" guarantee for Aggregators. I can only speak for Spark saying > > that > > supporting the Metrics API relies on the same underlying mechanism and > > so > > nothing will change. I wonder about other runners, anyone plans to > > support > > "commit" guarantees for Metrics soon ? having said that, not sure this > > is a > > good reason not to have this as a placeholder. > > > > Another question for querying Metrics - querying by step could be a bit > > tricky since a runner is expected to keep unique naming/ids for steps, > but > > users are supposed to be aware of this here and I'd suspect users might > not > > follow and if they use the same ParDo in a couple of places they'll query > > it and it might be confusing for them to see "double counts" if they > didn't > > mean for that. > > > > Amit. > > > > On Thu, Jan 19, 2017 at 7:36 PM Ben Chambers > <bchamb...@google.com.invalid > > > > > wrote: > > > > > Thanks for starting the discussion! I'm going to hold off saying what I > > > think and instead just provide some background and additional > questions, > > > because I want to see where the discussion goes. > > > > > > When I first suggested the API for querying metrics I was adding it for > > > parity with aggregators. A good first question might be does the > pipeline > > > result even need query methods? Runners could add them as necessary > based > > > on the levels of querying the support. > > > > > > The other desire was to make the accuracy clear. One implementation > path > > > was reporting metrics directly from the workers while attempting work. > > This > > > can overcount when retrying and may be under the actual attempts if the > > > worker lost connectivity before reporting. > > > > > > Another implementation was something like a side output where the > counts > > > are committed as part of each bundles results, and then aggregated. > This > > > committed value is more accurate and represents the value that occurred > > > along the success path of the pipeline. > > > > > > I suspect there are other possible implementations so trying to make an > > API > > > that expresses all of them is difficult. So: > > > > > > 1. Does pipeline result need to support querying (which is useful for > > > programmatic consumption) or are metrics intended only to get values > out > > of > > > a pipeline and into some metrics store? > > > > > > 2. How should pipeline results indicate the different kinds of metrics? > > > What if a runner supported multiple kinds (eg, the runner reports both > > > attempted and committed results)? As Luke mentions it may be useful to > > look > > > at both to understand how much retries affected the value. > > > On Thu, Jan 19, 2017, 1:42 AM Aviem Zur <aviem...@gmail.com> wrote: > > > > > > Hi all, > > > > > > While working on the implementation of metrics API in Spark runner the > > > question of committed vs. attempted results has come up, sparking (no > pun > > > intended) an interesting conversation. > > > (See API: MetricResult > > > < > > > > > > > > > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/metrics/MetricResult.java > > > > > > > and > > > discussion: PR #1750 <https://github.com/apache/beam/pull/1750>) > > > > > > The separation of `attempted` and `committed` metric results seems a > bit > > > unclear. > > > > > > Seeing that current implementations of aggregators in the different > > runners > > > do not guarantee correctness, one could assume that the metrics API > > > implementations will also follow the same guarantees. > > > > > > If this is correct, then you could assume that only `attempted()` > metrics > > > results can be fulfilled. > > > Would it then be better to just have a single method such as `get()` in > > the > > > API, and have the guarantees of each runner explained in the capability > > > matrix / documentation? > > > > > >