Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Matthias J. Sax Tue, 03 Apr 2018 19:05:34 -0700

Sounds great!

The cryptic topic names can be an issue -- however, people can
`describe()`  their topology to map the name to the corresponding
sub-topology/tasks to narrow the error down to the corresponding
operators. I think, this should be "sufficient for now" for debugging.


Renaming those topic seems to be out-of-scope for this KIP.


-Matthias

On 4/3/18 2:45 PM, Guozhang Wang wrote:
> Thanks John, your proposal looks fine to me.
> 
> I'll go ahead and look into the PR for more details myself.
> 
> 
> Guozhang
> 
> On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <bbej...@gmail.com> wrote:
> 
>> Hi John,
>>
>> Thanks for making the updates.
>>
>> I agree with the information you've included in the logs as described
>> above, as log statements without enough context/information can be
>> frustrating.
>>
>> -Bill
>>
>> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <j...@confluent.io> wrote:
>>
>>> Allrighty, how about this, then...
>>>
>>> I'll move the metric back to the StreamThread and maintain the existing
>> tag
>>> (client-id=...(per-thread client-id)). It won't be present in the
>>> TopologyTestDriver's metrics.
>>>
>>> As a side note, I'm not sure that the location of the log messages has
>>> visibility into the name of the thread or the task, or the processor
>> node,
>>> for that matter. But at the end of the day, I don't think it really
>>> matters.
>>>
>>> None of those identifiers are in the public interface or user-controlled.
>>> For them to be useful for debugging, users would have to gain a very deep
>>> understanding of how their DSL program gets executed. From my
>> perspective,
>>> they are all included in metric tags only to prevent collisions between
>> the
>>> same metrics in different (e.g.) threads.
>>>
>>> I think what's important is to provide the right information in the logs
>>> that users will be able to debug their issues. This is why the logs in my
>>> pr include the topic/partition/offset of the offending data, as well as
>> the
>>> stacktrace of the exception from the deserializer (or for timestamps, the
>>> extracted timestamp and the class name of their extractor). This
>>> information alone should let them pinpoint the offending data and fix it.
>>>
>>> (I am aware that that topic name might be a repartition topic, and
>>> therefore also esoteric from the user's perspective, but I think it's the
>>> best we can do right now. It might be nice to explicitly take on a
>>> debugging ergonomics task in the future and give all processor nodes
>>> human-friendly names. Then, we could surface these names in any logs or
>>> exceptions. But I'm inclined to call this out-of-scope for now.)
>>>
>>> Thanks again,
>>> -John
>>>
>>> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wangg...@gmail.com>
>> wrote:
>>>
>>>> 1. If we can indeed gather all the context information from the log4j
>>>> entries I'd suggest we change to thread-level (I'm not sure if that is
>>>> doable, so if John have already some WIP PR that can help us decide).
>>>>
>>>> 2. We can consider adding the API in TopologyTestDriver for general
>>> testing
>>>> purposes; that being said, I think Matthias has a good point that this
>>>> alone should not be a driving motivation for us to keep this metric as
>>>> task-level if 1) is true.
>>>>
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <
>> matth...@confluent.io>
>>>> wrote:
>>>>
>>>>> Thanks Guozhang, that was my intent.
>>>>>
>>>>> @John: yes, we should not nail down the exact log message. It's just
>> to
>>>>> point out the trade-off. If we can get the required information in
>> the
>>>>> logs, we might not need task level metrics.
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>> On 4/3/18 11:26 AM, Guozhang Wang wrote:
>>>>>> I think Matthias' comment is that, we can still record the metrics
>> on
>>>> the
>>>>>> thread-level, while having the WARN log entry to include sufficient
>>>>> context
>>>>>> information so that users can still easily narrow down the
>>>> investigation
>>>>>> scope.
>>>>>>
>>>>>>
>>>>>> Guozhang
>>>>>>
>>>>>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io>
>>>> wrote:
>>>>>>
>>>>>>> I agree we should add as much information as is reasonable to the
>>> log.
>>>>> For
>>>>>>> example, see this WIP PR I started for this KIP:
>>>>>>>
>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>> 88d129f048bc842c7db5b2566a45fce8R80
>>>>>>>
>>>>>>> and
>>>>>>>
>>>>>>> https://github.com/apache/kafka/pull/4812/files#diff-
>>>>>>> 69e6789eb675ec978a1abd24fed96eb1R111
>>>>>>>
>>>>>>> I'm not sure if we should nail down the log messages in the KIP or
>>> in
>>>>> the
>>>>>>> PR discussion. What say you?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> -John
>>>>>>>
>>>>>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
>>>> matth...@confluent.io
>>>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for sharing your thoughts. As I mentioned originally, I am
>>> not
>>>>>>>> sure about the right log level either. Your arguments are
>>> convincing
>>>> --
>>>>>>>> thus, I am fine with keeping WARN level.
>>>>>>>>
>>>>>>>> The task vs thread level argument is an interesting one.
>> However, I
>>>> am
>>>>>>>> wondering if we should add this information into the
>> corresponding
>>>> WARN
>>>>>>>> logs that we write anyway? For this case, we can also log the
>>>>>>>> corresponding operator (and other information like topic name etc
>>> if
>>>>>>>> needed). WDYT about this?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>>>>>>> Regarding logging: I'm inclined to keep logging at WARN level
>>> since
>>>>>>>> skipped
>>>>>>>>> records are not expected in normal execution (for all reasons
>> that
>>>> we
>>>>>>> are
>>>>>>>>> aware of), and hence when error happens users should be alerted
>>> from
>>>>>>>>> metrics and looked into the log files, so to me if it is really
>>>>>>> spamming
>>>>>>>>> the log files it is also a good alert for users. Besides for
>>>>>>> deserialize
>>>>>>>>> errors we already log at WARN level for this reason.
>>>>>>>>>
>>>>>>>>> Regarding the metrics-levels: I was pondering on that as well.
>>> What
>>>>>>> made
>>>>>>>> me
>>>>>>>>> to think and agree on task-level than thread-level is that for
>>> some
>>>>>>>> reasons
>>>>>>>>> like window retention, they may possibly be happening on a
>> subset
>>> of
>>>>>>>> input
>>>>>>>>> partitions, and tasks are correlated with partitions the
>>> task-level
>>>>>>>> metrics
>>>>>>>>> can help users to narrow down on the specific input data
>>> partitions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Guozhang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io
>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Matthias,
>>>>>>>>>>
>>>>>>>>>> No worries! Thanks for the reply.
>>>>>>>>>>
>>>>>>>>>> 1) There isn't a connection. I tried using the
>> TopologyTestDriver
>>>> to
>>>>>>>> write
>>>>>>>>>> a quick test exercising the current behavior and discovered
>> that
>>>> the
>>>>>>>>>> metrics weren't available. It seemed like they should be, so I
>>>> tacked
>>>>>>>> it on
>>>>>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back
>>>> out.
>>>>>>>>>>
>>>>>>>>>> 2) I was also concerned about that, but I figured it would come
>>> up
>>>> in
>>>>>>>>>> discussion if I just went ahead and proposed it. And here we
>> are!
>>>>>>>>>>
>>>>>>>>>> Here's my thought: maybe there are two classes of skips:
>>>> "controlled"
>>>>>>>> and
>>>>>>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>>>>>>> deliberately
>>>>>>>>>> filter out some events, and "uncontrolled" means that I simply
>>>> don't
>>>>>>>>>> account for some feature of the data, and the framework skips
>>> them
>>>>> (as
>>>>>>>>>> opposed to crashing).
>>>>>>>>>>
>>>>>>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>>>>>>> uncontrolled
>>>>>>>>>> skips (and we hope to measure all the uncontrolled skips). Our
>>>> skips
>>>>>>> are
>>>>>>>>>> well documented, so it wouldn't be terrible to have an
>>> application
>>>> in
>>>>>>>> which
>>>>>>>>>> you know you expect to have tons of uncontrolled skips, but
>> it's
>>>> not
>>>>>>>> great
>>>>>>>>>> either, since you may also have some *unexpected* uncontrolled
>>>> skips.
>>>>>>>> It'll
>>>>>>>>>> be difficult to notice, since you're probably not alerting on
>> the
>>>>>>> metric
>>>>>>>>>> and filtering out the logs (whatever their level).
>>>>>>>>>>
>>>>>>>>>> I'd recommend any app author, as an alternative, to convert all
>>>>>>> expected
>>>>>>>>>> skips to controlled ones, by updating the topology to filter
>>> those
>>>>>>>> records
>>>>>>>>>> out.
>>>>>>>>>>
>>>>>>>>>> Following from my recommendation, as a library author, I'm
>>> inclined
>>>>> to
>>>>>>>> mark
>>>>>>>>>> those logs WARN, since in my opinion, they should be concerning
>>> to
>>>>> the
>>>>>>>> app
>>>>>>>>>> authors. I'd definitely want to show, rather than hide, them by
>>>>>>>> default, so
>>>>>>>>>> I would pick INFO at least.
>>>>>>>>>>
>>>>>>>>>> That said, logging is always a tricky issue for lower-level
>>>> libraries
>>>>>>>> that
>>>>>>>>>> run inside user code, since we don't have all the information
>> we
>>>> need
>>>>>>> to
>>>>>>>>>> make the right call.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On your last note, yeah, I got that impression from Guozhang as
>>>> well.
>>>>>>>>>> Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>> -John
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>>>>>>> matth...@confluent.io>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> John,
>>>>>>>>>>>
>>>>>>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>>>>>>
>>>>>>>>>>> I like your approach about "metrics are for monitoring, logs
>> are
>>>> for
>>>>>>>>>>> debugging" -- however:
>>>>>>>>>>>
>>>>>>>>>>> 1) I don't see a connection between this and the task-level
>>>> metrics
>>>>>>>> that
>>>>>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I
>> don't
>>>>> think
>>>>>>>>>>> people would monitor the `TopologyTestDriver` an thus
>> wondering
>>>> why
>>>>>>> it
>>>>>>>>>>> is important to include the metrics there? Thread-level metric
>>>> might
>>>>>>> be
>>>>>>>>>>> easier to monitor though (ie, less different metric to
>> monitor).
>>>>>>>>>>>
>>>>>>>>>>> 2) I am a little worried about WARN level logging and that it
>>>> might
>>>>>>> be
>>>>>>>>>>> too chatty -- as you pointed out, it's about debugging, thus
>>> DEBUG
>>>>>>>> level
>>>>>>>>>>> might be better. Not 100% sure about this to be honest. What
>> is
>>>> the
>>>>>>>>>>> general assumption about the frequency for skipped records? I
>>>> could
>>>>>>>>>>> imagine cases for which skipped records are quite frequent and
>>>> thus,
>>>>>>>>>>> WARN level logs might "flood" the logs
>>>>>>>>>>>
>>>>>>>>>>> One final remark:
>>>>>>>>>>>
>>>>>>>>>>>> More
>>>>>>>>>>>> generally, I would like to establish a pattern in which we
>>> could
>>>>> add
>>>>>>>>>> new
>>>>>>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>>>>>>
>>>>>>>>>>> From my understanding, this is not feasible. Changing metrics
>> is
>>>>>>> always
>>>>>>>>>>> considered a public API change, and we need a KIP for any
>>> change.
>>>> As
>>>>>>> we
>>>>>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore
>>> --
>>>>>>> just
>>>>>>>>>>> wanted to point it out.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Matthias
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>>>>>>> Allrighty! The KIP is updated.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks again, all, for the feedback.
>>>>>>>>>>>> -John
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
>>> j...@confluent.io
>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing
>> consensus,
>>>> I'd
>>>>>>>> like
>>>>>>>>>>> to
>>>>>>>>>>>>> put it in the task instead of the thread so that it'll show
>> up
>>>> in
>>>>>>> the
>>>>>>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm leaning toward keeping the scope where it is right now,
>>> but
>>>> if
>>>>>>>>>>> others
>>>>>>>>>>>>> want to advocate for tossing in some more metrics, we can go
>>>> that
>>>>>>>>>> route.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks all,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
>>> bbej...@gmail.com
>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm on the fence with providing a single level metrics,
>> but I
>>>>>>> think
>>>>>>>>>>> we'll
>>>>>>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>>>>> per-task,
>>>>>>>>>>> or
>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>> topic/partition/offset
>>>>>>>>>> and
>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Finally, we have had requests in the past for some metrics
>>>> around
>>>>>>>>>> when
>>>>>>>>>>>>>> persistent store removes an expired window.  Would adding
>>> that
>>>> to
>>>>>>>> our
>>>>>>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Bill
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>>>>>>> wangg...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The proposal sounds good to me. About "maintain only one
>>> level
>>>>> of
>>>>>>>>>>>>>> metrics"
>>>>>>>>>>>>>>> maybe we can discuss about that separately from this KIP
>>> since
>>>>>>> that
>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>> be a larger scope of discussion. I agree that if we are
>>> going
>>>> to
>>>>>>>>>>>>>> maintain
>>>>>>>>>>>>>>> only one-level metrics it should be lowest level and we
>>> would
>>>>> let
>>>>>>>>>>> users
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully
>>> convinced
>>>>>>> that
>>>>>>>>>> we
>>>>>>>>>>>>>>> should just provide single-level metrics, because 1) I
>> think
>>>> for
>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>> metrics people may be interested to investigate into
>>> different
>>>>>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at
>> the
>>>>>>> lowest
>>>>>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate
>> they
>>>> can
>>>>>>> be
>>>>>>>>>> as
>>>>>>>>>>>>>> low
>>>>>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups
>> may
>>>> not
>>>>>>> be
>>>>>>>>>>> very
>>>>>>>>>>>>>>> straight-forward. But for 2) if someone can provide an
>>>> efficient
>>>>>>>> and
>>>>>>>>>>>>>> easy
>>>>>>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For now I'm thinking we can add the metric on
>> thread-level,
>>>>>>> either
>>>>>>>>>>> with
>>>>>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated
>> one
>>>>>>> without
>>>>>>>>>>> the
>>>>>>>>>>>>>>> tag, or just having a single aggregated metric without the
>>> tag
>>>>>>>> looks
>>>>>>>>>>>>>> good
>>>>>>>>>>>>>>> to me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
>>>>> j...@confluent.io
>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
>>>> provide
>>>>>>> a
>>>>>>>>>>>>>> list in
>>>>>>>>>>>>>>>> the KIP. I was also thinking we'd better start a
>>>> documentation
>>>>>>>> page
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'd have no problem logging a warning when we skip
>> records.
>>>> On
>>>>>>> the
>>>>>>>>>>>>>> metric
>>>>>>>>>>>>>>>> front, really I'm just pushing for us to maintain only
>> one
>>>>> level
>>>>>>>> of
>>>>>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
>>>> don't
>>>>>>>>>> have a
>>>>>>>>>>>>>>>> metric per reason and log the reason instead), that's
>> fine
>>> by
>>>>>>> me.
>>>>>>>> I
>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>> don't think it provides a lot of extra value per
>> complexity
>>>>>>>>>>> (interface
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level
>> in
>>>>>>>>>> addition
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> How about this instead:
>>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be
>> per-thread,
>>>>>>>>>> per-task,
>>>>>>>>>>>>>> or
>>>>>>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>>>>>>> * introduce a warn-level log detailing the
>>>>>>> topic/partition/offset
>>>>>>>>>> and
>>>>>>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>>>>>>> wangg...@gmail.com
>>>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>> on
>>>>>>> coarser
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>> systems
>>>>> I
>>>>>>>>>> have
>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>> tags.
>>>>>>> So
>>>>>>>>>> my
>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>>>>>> thinking
>>>>>>>>>>>>>> of an
>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>> valuable?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>>>>>>> directly
>>>>>>>>>> as
>>>>>>>>>>>>>>> part
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> the object name, and if users wants to monitor them they
>>>> need
>>>>>>> to
>>>>>>>>>>>>>> know
>>>>>>>>>>>>>>>> these
>>>>>>>>>>>>>>>>> values before hand. That is also why I think we do want
>> to
>>>>> list
>>>>>>>>>> all
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>> of
>>>> the
>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>> DEBUG
>>>>> on
>>>>>>>>>> all
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>> record
>>>>>>>> once,
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>> the
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>> the
>>>>>>> time.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That is a good point. I think we can either expose all
>>>> levels
>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>> default, or only expose the most lower-level metrics and
>>> get
>>>>>>> rid
>>>>>>>>>> of
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will
>> be
>>> a
>>>>>>> much
>>>>>>>>>>>>>> larger
>>>>>>>>>>>>>>>>> scope for discussion), or we can encourage users to not
>>>> purely
>>>>>>>>>>>>>> depend
>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
>>>> only
>>>>>>> be
>>>>>>>>>>>>>>> alerted
>>>>>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j
>> entry
>>>>> each
>>>>>>>>>>>>>> time we
>>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>>>> about to skip a record all over the places, so that upon
>>>> being
>>>>>>>>>>>>>> notified
>>>>>>>>>>>>>>>>> users can look into the logs to find the details on
>> where
>>> /
>>>>>>> when
>>>>>>>>>> it
>>>>>>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>>>>>>> j...@confluent.io
>>>>>>>>>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 1.
>>>>>>>>>>>>>>>>>> Matthias raised the same question about the "reason"
>> tag
>>>>>>> values.
>>>>>>>>>> I
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm
>> thinking
>>>>> this
>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
>>>>> documentation
>>>>>>>>>>>>>> review
>>>>>>>>>>>>>>>>> would
>>>>>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it
>>>> included
>>>>>>> in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
>>> pattern
>>>> of
>>>>>>>>>>>>>> keeping
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions
>>> that
>>>>>>> were
>>>>>>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2.
>>>>>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>>>>>>> becomes a
>>>>>>>>>>>>>>> factor
>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>> choosing whether to (by default) update one coarse
>> metric
>>>> at
>>>>>>> the
>>>>>>>>>>>>>>> thread
>>>>>>>>>>>>>>>>>> level or one granular metric at the processor-node
>> level,
>>>>>>> since
>>>>>>>>>>>>>> it's
>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>>>> one metric being updated either way. I do agree that if
>>> we
>>>>>>> were
>>>>>>>>>> to
>>>>>>>>>>>>>>>> update
>>>>>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
>>> should
>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I agree it's probably not necessary to surface the
>>> metrics
>>>>> for
>>>>>>>>>> all
>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>> regardless of whether they can or do skip records.
>>> Perhaps
>>>> we
>>>>>>>> can
>>>>>>>>>>>>>>>> lazily
>>>>>>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example
>> of
>>>> the
>>>>>>>>>>>>>> kind of
>>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with
>>>> DEBUG
>>>>>>> on
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
>>>> record
>>>>>>>>>> once,
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all
>>> the
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>> information about the reason and location of skips all
>>> the
>>>>>>> time.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts
>> on
>>>>>>> coarser
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting
>>>> systems
>>>>>>> I
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
>>>> tags.
>>>>>>> So
>>>>>>>>>>>>>> my
>>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were
>> you
>>>>>>>> thinking
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> an
>>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation
>>> valuable?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>>>>>>> wangg...@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1. Could you list all the possible values of the
>>> "reason"
>>>>>>> tag?
>>>>>>>>>>>>>> In
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
>>> if
>>>>>>>> you're
>>>>>>>>>>>>>>>> going
>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is
>>> there
>>>>> any
>>>>>>>>>>>>>>>>> additional
>>>>>>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric
>> that
>>> do
>>>>>>> not
>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
>>> This
>>>>> is
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>> users
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise
>> they
>>>> have
>>>>>>> to
>>>>>>>>>>>>>>> write
>>>>>>>>>>>>>>>>> on
>>>>>>>>>>>>>>>>>>> notification rule per reason) in their monitoring
>>> systems.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>>>>>>> per-processor-node", and today we only set the
>>> per-thread
>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
>>> with
>>>>>>> your
>>>>>>>>>>>>>>>> argument
>>>>>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
>>> today,
>>>>> but
>>>>>>>>>>>>>> I'm
>>>>>>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Note the recoding implementation of these three levels
>>> are
>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>>>>>>> generate
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
>>>> just
>>>>>>>>>>>>>> record
>>>>>>>>>>>>>>>> them
>>>>>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple
>>>> levels
>>>>>>> of
>>>>>>>>>>>>>>>> metrics,
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
>>>> that
>>>>>>> is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> best
>>>>>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>>>>>>> reporting,
>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>> by
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>>>>>>> implementation
>>>>>>>>>>>>>>>> could
>>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the
>>>>> lowest
>>>>>>>>>>>>>>> level,
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>> then do the roll-up calculations outside of the
>> metrics
>>>>>>>> classes.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>>>>>>> records,
>>>>>>>>>>>>>>> AFAIK
>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>>>>>>> aggregation
>>>>>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor
>> looks
>>>> like
>>>>>>> an
>>>>>>>>>>>>>>>>> overkill
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
>>> the
>>>>>>>>>>>>>> "reason"
>>>>>>>>>>>>>>>> tag
>>>>>>>>>>>>>>>>>> may
>>>>>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>>>>>>> topology
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>> causing
>>>>>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>>>>>>> "per-thread,
>>>>>>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
>>>> shoot
>>>>>>> in
>>>>>>>>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
>>> INFO
>>>>>>> which
>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> turned
>>>>>>>>>>>>>>>>>>> on by default. So under normal execution users still
>>> only
>>>>>>> need
>>>>>>>>>>>>>> INFO
>>>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
>>>> skipped-records
>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>> as
>>>>>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can
>> turn
>>> on
>>>>>>>> DEBUG
>>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> look into which task is actually causing the skipped
>>>>> records.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>>>>>>> matth...@confluent.io>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
>>> wondering
>>>>>>> what
>>>>>>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The
>> KIP
>>>>>>> should
>>>>>>>>>>>>>>> list
>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you
>> say
>>>> that
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
>>>> convinced
>>>>>>> atm
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level
>> and
>>>>>>> remove
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>>>>>>> top-level
>>>>>>>>>>>>>>>> view
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the
>>> conclusion.
>>>>>>>>>>>>>>> Offering
>>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the
>> box
>>>>>>> might
>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>> a
>>>>>>>>>>>>>>>>>>>> better solution. In the past we did not offer this
>> due
>>> to
>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism.
>> If
>>>> you
>>>>>>>>>>>>>>>> disagree,
>>>>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>>>>>>> you provide some reasoning and add them to the
>>> "Rejected
>>>>>>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
>>>> top-level
>>>>>>>>>>>>>> view,
>>>>>>>>>>>>>>>> but
>>>>>>>>>>>>>>>>>>>> instead of going more fine grained, we should
>> consider
>>> to
>>>>>>> add
>>>>>>>>>>>>>>> this
>>>>>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics
>> at
>>>>>>> DEBUG
>>>>>>>>>>>>>>> level
>>>>>>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
>>> remove
>>>>>>> old
>>>>>>>>>>>>>>>> metrics
>>>>>>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>>>>>>> j...@confluent.io
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics
>> around
>>>>>>>>>>>>>> skipped
>>>>>>>>>>>>>>>>> records
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/
>>> confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> -- Guozhang
>>>>
>>>
>>
> 
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to