Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Matthias J. Sax Tue, 03 Apr 2018 11:40:08 -0700

I am fine adding `TopologyTestDriver#metrics()`. It should be helpful to
test custom metrics users can implement.


-Matthias


On 4/3/18 11:35 AM, John Roesler wrote:
> Oh, sorry, I missed the point.
> 
> Yeah, we can totally do that. The reason to move it to the task level was
> mainly to make it available for the metrics in TopologyTestDriver as well.
> But if we decide that's a non-goal, then there's no motivation to change it.
> 
> And actually that reminds me that we do have an open question about whether
> I should add a metrics getter to the TopologyTestDriver's interface. WDYT?
> 
> Thanks,
> -John
> 
> On Tue, Apr 3, 2018 at 1:26 PM, Guozhang Wang <wangg...@gmail.com> wrote:
> 
>> I think Matthias' comment is that, we can still record the metrics on the
>> thread-level, while having the WARN log entry to include sufficient context
>> information so that users can still easily narrow down the investigation
>> scope.
>>
>>
>> Guozhang
>>
>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> wrote:
>>
>>> I agree we should add as much information as is reasonable to the log.
>> For
>>> example, see this WIP PR I started for this KIP:
>>>
>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>> 88d129f048bc842c7db5b2566a45fce8R80
>>>
>>> and
>>>
>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>> 69e6789eb675ec978a1abd24fed96eb1R111
>>>
>>> I'm not sure if we should nail down the log messages in the KIP or in the
>>> PR discussion. What say you?
>>>
>>> Thanks,
>>> -John
>>>
>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matth...@confluent.io>
>>> wrote:
>>>
>>>> Thanks for sharing your thoughts. As I mentioned originally, I am not
>>>> sure about the right log level either. Your arguments are convincing --
>>>> thus, I am fine with keeping WARN level.
>>>>
>>>> The task vs thread level argument is an interesting one. However, I am
>>>> wondering if we should add this information into the corresponding WARN
>>>> logs that we write anyway? For this case, we can also log the
>>>> corresponding operator (and other information like topic name etc if
>>>> needed). WDYT about this?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>>> Regarding logging: I'm inclined to keep logging at WARN level since
>>>> skipped
>>>>> records are not expected in normal execution (for all reasons that we
>>> are
>>>>> aware of), and hence when error happens users should be alerted from
>>>>> metrics and looked into the log files, so to me if it is really
>>> spamming
>>>>> the log files it is also a good alert for users. Besides for
>>> deserialize
>>>>> errors we already log at WARN level for this reason.
>>>>>
>>>>> Regarding the metrics-levels: I was pondering on that as well. What
>>> made
>>>> me
>>>>> to think and agree on task-level than thread-level is that for some
>>>> reasons
>>>>> like window retention, they may possibly be happening on a subset of
>>>> input
>>>>> partitions, and tasks are correlated with partitions the task-level
>>>> metrics
>>>>> can help users to narrow down on the specific input data partitions.
>>>>>
>>>>>
>>>>> Guozhang
>>>>>
>>>>>
>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io>
>>> wrote:
>>>>>
>>>>>> Hi Matthias,
>>>>>>
>>>>>> No worries! Thanks for the reply.
>>>>>>
>>>>>> 1) There isn't a connection. I tried using the TopologyTestDriver to
>>>> write
>>>>>> a quick test exercising the current behavior and discovered that the
>>>>>> metrics weren't available. It seemed like they should be, so I
>> tacked
>>>> it on
>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back out.
>>>>>>
>>>>>> 2) I was also concerned about that, but I figured it would come up
>> in
>>>>>> discussion if I just went ahead and proposed it. And here we are!
>>>>>>
>>>>>> Here's my thought: maybe there are two classes of skips:
>> "controlled"
>>>> and
>>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>>> deliberately
>>>>>> filter out some events, and "uncontrolled" means that I simply don't
>>>>>> account for some feature of the data, and the framework skips them
>> (as
>>>>>> opposed to crashing).
>>>>>>
>>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>>> uncontrolled
>>>>>> skips (and we hope to measure all the uncontrolled skips). Our skips
>>> are
>>>>>> well documented, so it wouldn't be terrible to have an application
>> in
>>>> which
>>>>>> you know you expect to have tons of uncontrolled skips, but it's not
>>>> great
>>>>>> either, since you may also have some *unexpected* uncontrolled
>> skips.
>>>> It'll
>>>>>> be difficult to notice, since you're probably not alerting on the
>>> metric
>>>>>> and filtering out the logs (whatever their level).
>>>>>>
>>>>>> I'd recommend any app author, as an alternative, to convert all
>>> expected
>>>>>> skips to controlled ones, by updating the topology to filter those
>>>> records
>>>>>> out.
>>>>>>
>>>>>> Following from my recommendation, as a library author, I'm inclined
>> to
>>>> mark
>>>>>> those logs WARN, since in my opinion, they should be concerning to
>> the
>>>> app
>>>>>> authors. I'd definitely want to show, rather than hide, them by
>>>> default, so
>>>>>> I would pick INFO at least.
>>>>>>
>>>>>> That said, logging is always a tricky issue for lower-level
>> libraries
>>>> that
>>>>>> run inside user code, since we don't have all the information we
>> need
>>> to
>>>>>> make the right call.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On your last note, yeah, I got that impression from Guozhang as
>> well.
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> -John
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>>> matth...@confluent.io>
>>>>>> wrote:
>>>>>>
>>>>>>> John,
>>>>>>>
>>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>>
>>>>>>> I like your approach about "metrics are for monitoring, logs are
>> for
>>>>>>> debugging" -- however:
>>>>>>>
>>>>>>> 1) I don't see a connection between this and the task-level metrics
>>>> that
>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't
>> think
>>>>>>> people would monitor the `TopologyTestDriver` an thus wondering why
>>> it
>>>>>>> is important to include the metrics there? Thread-level metric
>> might
>>> be
>>>>>>> easier to monitor though (ie, less different metric to monitor).
>>>>>>>
>>>>>>> 2) I am a little worried about WARN level logging and that it might
>>> be
>>>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
>>>> level
>>>>>>> might be better. Not 100% sure about this to be honest. What is the
>>>>>>> general assumption about the frequency for skipped records? I could
>>>>>>> imagine cases for which skipped records are quite frequent and
>> thus,
>>>>>>> WARN level logs might "flood" the logs
>>>>>>>
>>>>>>> One final remark:
>>>>>>>
>>>>>>>> More
>>>>>>>> generally, I would like to establish a pattern in which we could
>> add
>>>>>> new
>>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>>
>>>>>>> From my understanding, this is not feasible. Changing metrics is
>>> always
>>>>>>> considered a public API change, and we need a KIP for any change.
>> As
>>> we
>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore --
>>> just
>>>>>>> wanted to point it out.
>>>>>>>
>>>>>>>
>>>>>>> -Matthias
>>>>>>>
>>>>>>>
>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>>> Allrighty! The KIP is updated.
>>>>>>>>
>>>>>>>> Thanks again, all, for the feedback.
>>>>>>>> -John
>>>>>>>>
>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>>
>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
>>>> like
>>>>>>> to
>>>>>>>>> put it in the task instead of the thread so that it'll show up in
>>> the
>>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>>
>>>>>>>>> I'm leaning toward keeping the scope where it is right now, but
>> if
>>>>>>> others
>>>>>>>>> want to advocate for tossing in some more metrics, we can go that
>>>>>> route.
>>>>>>>>>
>>>>>>>>> Thanks all,
>>>>>>>>> -John
>>>>>>>>>
>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>>
>>>>>>>>>> I'm on the fence with providing a single level metrics, but I
>>> think
>>>>>>> we'll
>>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>>
>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>> per-task,
>>>>>>> or
>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>> * introduce a warn-level log detailing the
>> topic/partition/offset
>>>>>> and
>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>
>>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>>
>>>>>>>>>> Finally, we have had requests in the past for some metrics
>> around
>>>>>> when
>>>>>>>>>> persistent store removes an expired window.  Would adding that
>> to
>>>> our
>>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>>
>>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>>
>>>>>>>>>> Bill
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>>> wangg...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The proposal sounds good to me. About "maintain only one level
>> of
>>>>>>>>>> metrics"
>>>>>>>>>>> maybe we can discuss about that separately from this KIP since
>>> that
>>>>>>>>>> would
>>>>>>>>>>> be a larger scope of discussion. I agree that if we are going
>> to
>>>>>>>>>> maintain
>>>>>>>>>>> only one-level metrics it should be lowest level and we would
>> let
>>>>>>> users
>>>>>>>>>> to
>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced
>>> that
>>>>>> we
>>>>>>>>>>> should just provide single-level metrics, because 1) I think
>> for
>>>>>>>>>> different
>>>>>>>>>>> metrics people may be interested to investigate into different
>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at the
>>> lowest
>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate they can
>>> be
>>>>>> as
>>>>>>>>>> low
>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
>>> be
>>>>>>> very
>>>>>>>>>>> straight-forward. But for 2) if someone can provide an
>> efficient
>>>> and
>>>>>>>>>> easy
>>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>>
>>>>>>>>>>> For now I'm thinking we can add the metric on thread-level,
>>> either
>>>>>>> with
>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one
>>> without
>>>>>>> the
>>>>>>>>>>> tag, or just having a single aggregated metric without the tag
>>>> looks
>>>>>>>>>> good
>>>>>>>>>>> to me.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Guozhang
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
>> j...@confluent.io
>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
>> provide
>>> a
>>>>>>>>>> list in
>>>>>>>>>>>> the KIP. I was also thinking we'd better start a documentation
>>>> page
>>>>>>>>>> with
>>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>>
>>>>>>>>>>>> I'd have no problem logging a warning when we skip records. On
>>> the
>>>>>>>>>> metric
>>>>>>>>>>>> front, really I'm just pushing for us to maintain only one
>> level
>>>> of
>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
>>>>>> have a
>>>>>>>>>>>> metric per reason and log the reason instead), that's fine by
>>> me.
>>>> I
>>>>>>>>>> just
>>>>>>>>>>>> don't think it provides a lot of extra value per complexity
>>>>>>> (interface
>>>>>>>>>>> and
>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level in
>>>>>> addition
>>>>>>>>>> to
>>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>>
>>>>>>>>>>>> How about this instead:
>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>>> per-task,
>>>>>>>>>> or
>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>> * introduce a warn-level log detailing the
>>> topic/partition/offset
>>>>>> and
>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>
>>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>>> wangg...@gmail.com
>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>>> coarser
>>>>>>>>>>>> metrics
>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>> systems I
>>>>>> have
>>>>>>>>>>> used
>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
>>> So
>>>>>> my
>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>> thinking
>>>>>>>>>> of an
>>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>>
>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>>> directly
>>>>>> as
>>>>>>>>>>> part
>>>>>>>>>>>> of
>>>>>>>>>>>>> the object name, and if users wants to monitor them they need
>>> to
>>>>>>>>>> know
>>>>>>>>>>>> these
>>>>>>>>>>>>> values before hand. That is also why I think we do want to
>> list
>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
>> the
>>>>>>>>>> kind of
>>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
>> on
>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>>> once,
>>>>>>>>>> that
>>>>>>>>>>>> it
>>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>>> available
>>>>>>>>>>>>> information about the reason and location of skips all the
>>> time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> That is a good point. I think we can either expose all levels
>>>>>>>>>> metrics
>>>>>>>>>>> as
>>>>>>>>>>>> by
>>>>>>>>>>>>> default, or only expose the most lower-level metrics and get
>>> rid
>>>>>> of
>>>>>>>>>>> other
>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a
>>> much
>>>>>>>>>> larger
>>>>>>>>>>>>> scope for discussion), or we can encourage users to not
>> purely
>>>>>>>>>> depend
>>>>>>>>>>> on
>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users only
>>> be
>>>>>>>>>>> alerted
>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry
>> each
>>>>>>>>>> time we
>>>>>>>>>>>> are
>>>>>>>>>>>>> about to skip a record all over the places, so that upon
>> being
>>>>>>>>>> notified
>>>>>>>>>>>>> users can look into the logs to find the details on where /
>>> when
>>>>>> it
>>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>>> j...@confluent.io
>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>> Matthias raised the same question about the "reason" tag
>>> values.
>>>>>> I
>>>>>>>>>>> can
>>>>>>>>>>>>> list
>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking
>> this
>>>>>>>>>> level
>>>>>>>>>>> of
>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
>> documentation
>>>>>>>>>> review
>>>>>>>>>>>>> would
>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it
>> included
>>> in
>>>>>>>>>> the
>>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern
>> of
>>>>>>>>>> keeping
>>>>>>>>>>>> the
>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions that
>>> were
>>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>>> becomes a
>>>>>>>>>>> factor
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> choosing whether to (by default) update one coarse metric at
>>> the
>>>>>>>>>>> thread
>>>>>>>>>>>>>> level or one granular metric at the processor-node level,
>>> since
>>>>>>>>>> it's
>>>>>>>>>>>> just
>>>>>>>>>>>>>> one metric being updated either way. I do agree that if we
>>> were
>>>>>> to
>>>>>>>>>>>> update
>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should
>>>>>>>>>> consider
>>>>>>>>>>> the
>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics
>> for
>>>>>> all
>>>>>>>>>>>> nodes
>>>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps
>> we
>>>> can
>>>>>>>>>>>> lazily
>>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of
>> the
>>>>>>>>>> kind of
>>>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
>>> on
>>>>>>>>>> all
>>>>>>>>>>> the
>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>>>>> once,
>>>>>>>>>>> that
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>>> available
>>>>>>>>>>>>>> information about the reason and location of skips all the
>>> time.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>>> coarser
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>> systems
>>> I
>>>>>>>>>> have
>>>>>>>>>>>> used
>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>> tags.
>>> So
>>>>>>>>>> my
>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>>> thinking
>>>>>>>>>> of
>>>>>>>>>>> an
>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>>> wangg...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason"
>>> tag?
>>>>>>>>>> In
>>>>>>>>>>> the
>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
>>>> you're
>>>>>>>>>>>> going
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is there
>> any
>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do
>>> not
>>>>>>>>>> have
>>>>>>>>>>>> the
>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This
>> is
>>>>>>>>>> for
>>>>>>>>>>>> users
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise they
>> have
>>> to
>>>>>>>>>>> write
>>>>>>>>>>>>> on
>>>>>>>>>>>>>>> notification rule per reason) in their monitoring systems.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>>> "per-thread,
>>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread
>>>>>>>>>> metrics
>>>>>>>>>>> as
>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
>>> your
>>>>>>>>>>>> argument
>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today,
>> but
>>>>>>>>>> I'm
>>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Note the recoding implementation of these three levels are
>>>>>>>>>>> different
>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>>> generate
>>>>>>>>>> the
>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we just
>>>>>>>>>> record
>>>>>>>>>>>> them
>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple levels
>>> of
>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
>>> is
>>>>>>>>>> not
>>>>>>>>>>>> the
>>>>>>>>>>>>>> best
>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>>> reporting,
>>>>>>>>>> but
>>>>>>>>>>> by
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>>> implementation
>>>>>>>>>>>> could
>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the
>> lowest
>>>>>>>>>>> level,
>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics
>>>> classes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>>> records,
>>>>>>>>>>> AFAIK
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>>> aggregation
>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks
>> like
>>> an
>>>>>>>>>>>>> overkill
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the
>>>>>>>>>> "reason"
>>>>>>>>>>>> tag
>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>>> topology
>>>>>>>>>> is
>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>>> "per-thread,
>>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
>> shoot
>>> in
>>>>>>>>>>> DEBUG
>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
>>> which
>>>>>>>>>> is
>>>>>>>>>>>>> turned
>>>>>>>>>>>>>>> on by default. So under normal execution users still only
>>> need
>>>>>>>>>> INFO
>>>>>>>>>>>>> level
>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
>> skipped-records
>>>>>>>>>>> metrics
>>>>>>>>>>>> as
>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
>>>> DEBUG
>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> look into which task is actually causing the skipped
>> records.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>>> matth...@confluent.io>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
>>> what
>>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
>>> should
>>>>>>>>>>> list
>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say
>> that
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
>>> atm
>>>>>>>>>>> that
>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and
>>> remove
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>>> top-level
>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
>>>>>>>>>>> Offering
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
>>> might
>>>>>>>>>> be
>>>>>>>>>>> a
>>>>>>>>>>>>>>>> better solution. In the past we did not offer this due to
>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
>>>>>>>>>>>> disagree,
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
>>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
>> top-level
>>>>>>>>>> view,
>>>>>>>>>>>> but
>>>>>>>>>>>>>>>> instead of going more fine grained, we should consider to
>>> add
>>>>>>>>>>> this
>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
>>> DEBUG
>>>>>>>>>>> level
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
>>> old
>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>>> j...@confluent.io
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
>>>>>>>>>> skipped
>>>>>>>>>>>>> records
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> -- Guozhang
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to