Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

Robert Bradshaw Fri, 13 Apr 2018 17:19:59 -0700

On Fri, Apr 13, 2018 at 4:30 PM Alex Amato <ajam...@google.com> wrote:


> There are a few more confusing concepts in this thread
> *Name*
>
>    - Name can mean a *"string name"* used to refer to a metric in a
>    metrics system such as stackdriver, i.e. "ElementCount", "ExecutionTime"
>    - Name can mean a set of *context* fields added to a counter, either
>    embedded in a complex string, or in a structured name. Typically referring
>    to *aggregation entities, *which define how the metric updates get
>    aggregated into final metric values, i.e. all Metric updates with the same
>    field are aggregated together.
>       - e.g.my_ptransform_id-ElementCount
>       - e.g.{ name : 'ElementCount', 'ptransform_name' :
>       'my_ptransform_id' }
>    - The *URN* of a Metric, which identifies a proto to use in a payload
>    field for the Metric and MetricSpec. Note: The string name, can literally
>    be the URN value in most cases, except for metrics which can specify a
>    separate name (i.e. user counters).
>
> @Robert,
> You have proposed that metrics should contain the following parts, I still
> don't fully understand what you mean by each one.
>
>    - Name - Why is a name a URN + bytes payload? What type of name are
>    you referring to, *string name*? *context*? *URN*? Or something else.
>
> As you say above, the URN can literally be the string name. I see no
reason why this can't be the case for user counters as well (the user
counter name becoming part of the urn). The payload, should we decide to
keep it, is "part" of the name because it helps identify what exactly we're
counting. I.e. {urnX, payload1} would be distinct from {urnX, payload2}.
The only reason to have a payload is to avoid sticking stuff that would be
ugly to parse into the URN.

>
>    - Entity - This is how the metric is aggregated together. If I
>    understand you correctly. And you correctly point out that a singular
>    entity is not sufficient, a set of labels may be more appropriate.
>
> Alternatively, the entity/labels specifies possible sub-partitions of the
metric identified by its name (as above).

>
>    - Value - *Are you saying this is just the metric value, not including
>    any fields related to entity or name.*
>
> Exactly. Like "5077." For some types it would be composite. The type also
indicates how it's encoded (e.g. as bytes, or which field of a oneof should
be populated).

>
>    - Type - I am not clear at all on what this is or what it would look
>    like. Are you referring to units, like milliseconds/seconds? Why it
>    wouldn't be part of the value payload. Is this some sort of reason to
>    separate it out from the value? What if the value has multiple fields for
>    example.
>
> Type would be "beam:metric_type:sum:ints" or
"beam:metric_type:distribution:doubles." We could separate "data type" from
"aggregation type" if desired, though of course the full cross-product
doesn't makes sense. We could put the unit in the type (e.g. sum_durations
!= sum_ints), but, preferably, I'd put this as metadata on the counter
spec. It is often fully determined by the URN, but provided so one can
reason about the metric without having to interpret the URN. It also means
we don't have to have a separate URN for each user metric type. (In fact,
any metric the runner doesn't understand would be treated as a user metric,
and aggregated as such if it understand the type.)

Some pros and cons as I see them
> Pros:
>
>    - More separation and flexibility for an SDK to specify labels
>    separately from the value/type. Though, maybe I don't understand enough,
>    and I am not so sure this is a con over just having the URN payload contain
>    everything in itself.
>
> We can't interpret a URN payload unless we know the URN. Separating things
out allows us to act on metrics without interpreting the URN (both for
unknown URNs, and simplifying the logic by not having to do lookups on the
URN everywhere).


> Cons:
>
>    - I think this means that the SDK must properly pick two separate
>    payloads and populate them correctly. We can run into issues where.
>       - Having one URN which specifies all the fields you would need to
>       populate for a specific metric avoids this, this was a concern brought 
> up
>       by Luke. The runner would then be responsible for packaging metrics up 
> to
>       send to external monitoring systems.
>
> I'm not following you here. We'd return exactly what Andrea suggested.


>
> @Andrea, please correct me if I misunderstand
> Thank you for the metric spec example in your last response, I think that
> makes the idea much more clear.
>
> Using your approach I see the following pros and cons
> Pros:
>
>    - Runners have a cleaner more reusable codepath to forwarding metrics
>    to external monitoring systems. This will mean less work on the runner side
>    to support each metric (perhaps none in many cases).
>    - SDKs may need less code as well to package up new metrics.
>    - As long\ as we expect SDKs to only send cataloged/requested metrics,
>    we can avoid the issues of SDKs creating too many metrics, metrics the
>    runner/engine don't understand, etc.
>
> Cons:
>
>    - Luke's concern with this approach was that this spec ends up boiling
>    down to just the name, in this case "my_timer". His concern is that with
>    many SDK implementations, we can have bugs using the wrong string name for
>    counters, or populating them with the wrong values.
>       - Note how the ParDoExecution time example in the doc lets you
>       build the group of metrics together, rather than reporting three 
> different
>       ones from SDK->Runner. This sort of thing can make it more clear how to
>       fill in metrics in the SDK side. Then the RunnerHarness is responsible 
> for
>       packaging the metrics up for monitoring systems, not the SDK side.
>    - Ruling out URNs+payloads altogether (Though, I don't think you are
>    suggesting this) is less extensible for custom runners+sdks+engines. I.e.
>    the table of I/O files example. It also rules out sending parameters for a
>    metric from the runner->SDK.
>    - Populating each metric spec in code in each Runner could be
>    similarly error prone. Instead of just stating "urn:namespace:my_timer",
>    you must specify this and each runner must get it correct:
>    - {
>       name: "my_timer"
>       labels: { "ptransform" }
>       type: GAUGE
>       value_type: int64
>       units: SECONDS
>       description: "Times my stuff"
>       }
>       - Would the MetricSpec be passed like that from the RunnerHarness
>       to the SDK? This part I am not so clear on.
>       - Do we want runners to accept and forward metrics they don't know
>    about? That was another concern, was to not accept them until both the SDK
>    and Runner have been updated to accept them. Consider the performance
>    implications of an SDK sending a noisy metric.
>
> This being said, I think some of this can be mitigated.
>
>    1. Could a URN describe a metric spec in the format you describe. I.e.
>    urn:namespace:my_timer is passed to the SDK, and it knows to somehow find a
>    catalog which tells it about the MetricSpec. This way with just a URN it
>    knows how to populate a common representation, like the MetricSpec you have
>    specified.
>    2. This common metric representation could be a specific URN+payload
>    used for most metrics, while extensible ones use a different URN+payload.
>
>
> Thanks for the discussion on this thread today. I'd like us to keep
> engaging like this to come to an agreement.
> Alex
>
> On Fri, Apr 13, 2018 at 2:23 PM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> On Fri, Apr 13, 2018 at 1:32 PM Kenneth Knowles <k...@google.com> wrote:
>>
>>>
>>> Or just "beam:counter:<namespace>:<name>" or even
>>> "beam:metric:<namespace>:<name>" since metrics have a type separate from
>>> their name.
>>>
>>
>> I proposed keeping the "user" in there to avoid possible clashes with the
>> system namespaces. (No preference on counter vs. metric, I wasn't trying to
>> imply counter = SumInts)
>>
>>
>> On Fri, Apr 13, 2018 at 2:02 PM Andrea Foegler <foeg...@google.com>
>> wrote:
>>
>>> I like the generalization from entity -> labels.  I view the purpose of
>>> those fields to provide context.  And labels feel like they supports a
>>> richer set of contexts.
>>>
>>
>> If we think such a generalization provides value, I'm fine with doing
>> that now, as sets or key-value maps, if we have good enough examples to
>> justify this.
>>
>>
>>> The URN concept gets a little tricky.  I totally agree that the context
>>> fields should not be embedded in the name.
>>> There's a "name" which is the identifier that can be used to communicate
>>> what context values are supported / allowed for metrics with that name (for
>>> example, element_count expects a ptransform ID).  But then there's the
>>> context.  In Stackdriver, this context is a map of key-value pairs; the
>>> type is considered metadata associated with the name, but not communicated
>>> with the value.
>>>
>>
>> I'm not quite following you here. If context contains a ptransform id,
>> then it cannot be associated with a single name.
>>
>>
>>> Could the URN be "beam:namespace:name" and every metric have a map of
>>> key-value pairs for context?
>>>
>>
>> The URN is the name. Something like
>> "beam:metric:ptransform_execution_times:v1."
>>
>>
>>> Not sure where this fits in the discussion or if this is handled
>>> somewhere, but allowing for a metric configuration that's provided
>>> independently of the value allows for configuring "type", "units", etc in a
>>> uniform way without having to encode them in the metric name / value.
>>> Stackdriver expects each metric type has been configured ahead of time with
>>> these annotations / metadata.  Then values are reported separately.  For
>>> system metrics, the definitions can be packaged with the SDK.  For user
>>> metrics, they'd be defined at runtime.
>>>
>>
>> This feels like the metrics spec, that specifies that the metric with
>> name/URN X has this type plus a bunch of other metadata (e.g. units, if
>> they're not implicit in the type? This gets into whether the type should be
>> Duration{Sum,Max,Distribution,...} vs. Int{Sum,Max,Distribution,...} +
>> units metadata).
>>
>

Re: Updated [Proposal] Apache Beam Fn API : Defining and adding SDK Metrics

Reply via email to