Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Guozhang Wang Tue, 03 Apr 2018 14:45:40 -0700

Thanks John, your proposal looks fine to me.

I'll go ahead and look into the PR for more details myself.



Guozhang

On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <[email protected]> wrote:

> Hi John,
>
> Thanks for making the updates.
>
> I agree with the information you've included in the logs as described
> above, as log statements without enough context/information can be
> frustrating.
>
> -Bill
>
> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <[email protected]> wrote:
>
> > Allrighty, how about this, then...
> >
> > I'll move the metric back to the StreamThread and maintain the existing
> tag
> > (client-id=...(per-thread client-id)). It won't be present in the
> > TopologyTestDriver's metrics.
> >
> > As a side note, I'm not sure that the location of the log messages has
> > visibility into the name of the thread or the task, or the processor
> node,
> > for that matter. But at the end of the day, I don't think it really
> > matters.
> >
> > None of those identifiers are in the public interface or user-controlled.
> > For them to be useful for debugging, users would have to gain a very deep
> > understanding of how their DSL program gets executed. From my
> perspective,
> > they are all included in metric tags only to prevent collisions between
> the
> > same metrics in different (e.g.) threads.
> >
> > I think what's important is to provide the right information in the logs
> > that users will be able to debug their issues. This is why the logs in my
> > pr include the topic/partition/offset of the offending data, as well as
> the
> > stacktrace of the exception from the deserializer (or for timestamps, the
> > extracted timestamp and the class name of their extractor). This
> > information alone should let them pinpoint the offending data and fix it.
> >
> > (I am aware that that topic name might be a repartition topic, and
> > therefore also esoteric from the user's perspective, but I think it's the
> > best we can do right now. It might be nice to explicitly take on a
> > debugging ergonomics task in the future and give all processor nodes
> > human-friendly names. Then, we could surface these names in any logs or
> > exceptions. But I'm inclined to call this out-of-scope for now.)
> >
> > Thanks again,
> > -John
> >
> > On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <[email protected]>
> wrote:
> >
> > > 1. If we can indeed gather all the context information from the log4j
> > > entries I'd suggest we change to thread-level (I'm not sure if that is
> > > doable, so if John have already some WIP PR that can help us decide).
> > >
> > > 2. We can consider adding the API in TopologyTestDriver for general
> > testing
> > > purposes; that being said, I think Matthias has a good point that this
> > > alone should not be a driving motivation for us to keep this metric as
> > > task-level if 1) is true.
> > >
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax <
> [email protected]>
> > > wrote:
> > >
> > > > Thanks Guozhang, that was my intent.
> > > >
> > > > @John: yes, we should not nail down the exact log message. It's just
> to
> > > > point out the trade-off. If we can get the required information in
> the
> > > > logs, we might not need task level metrics.
> > > >
> > > >
> > > > -Matthias
> > > >
> > > > On 4/3/18 11:26 AM, Guozhang Wang wrote:
> > > > > I think Matthias' comment is that, we can still record the metrics
> on
> > > the
> > > > > thread-level, while having the WARN log entry to include sufficient
> > > > context
> > > > > information so that users can still easily narrow down the
> > > investigation
> > > > > scope.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <[email protected]>
> > > wrote:
> > > > >
> > > > >> I agree we should add as much information as is reasonable to the
> > log.
> > > > For
> > > > >> example, see this WIP PR I started for this KIP:
> > > > >>
> > > > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > > > >> 88d129f048bc842c7db5b2566a45fce8R80
> > > > >>
> > > > >> and
> > > > >>
> > > > >> https://github.com/apache/kafka/pull/4812/files#diff-
> > > > >> 69e6789eb675ec978a1abd24fed96eb1R111
> > > > >>
> > > > >> I'm not sure if we should nail down the log messages in the KIP or
> > in
> > > > the
> > > > >> PR discussion. What say you?
> > > > >>
> > > > >> Thanks,
> > > > >> -John
> > > > >>
> > > > >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <
> > > [email protected]
> > > > >
> > > > >> wrote:
> > > > >>
> > > > >>> Thanks for sharing your thoughts. As I mentioned originally, I am
> > not
> > > > >>> sure about the right log level either. Your arguments are
> > convincing
> > > --
> > > > >>> thus, I am fine with keeping WARN level.
> > > > >>>
> > > > >>> The task vs thread level argument is an interesting one.
> However, I
> > > am
> > > > >>> wondering if we should add this information into the
> corresponding
> > > WARN
> > > > >>> logs that we write anyway? For this case, we can also log the
> > > > >>> corresponding operator (and other information like topic name etc
> > if
> > > > >>> needed). WDYT about this?
> > > > >>>
> > > > >>>
> > > > >>> -Matthias
> > > > >>>
> > > > >>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
> > > > >>>> Regarding logging: I'm inclined to keep logging at WARN level
> > since
> > > > >>> skipped
> > > > >>>> records are not expected in normal execution (for all reasons
> that
> > > we
> > > > >> are
> > > > >>>> aware of), and hence when error happens users should be alerted
> > from
> > > > >>>> metrics and looked into the log files, so to me if it is really
> > > > >> spamming
> > > > >>>> the log files it is also a good alert for users. Besides for
> > > > >> deserialize
> > > > >>>> errors we already log at WARN level for this reason.
> > > > >>>>
> > > > >>>> Regarding the metrics-levels: I was pondering on that as well.
> > What
> > > > >> made
> > > > >>> me
> > > > >>>> to think and agree on task-level than thread-level is that for
> > some
> > > > >>> reasons
> > > > >>>> like window retention, they may possibly be happening on a
> subset
> > of
> > > > >>> input
> > > > >>>> partitions, and tasks are correlated with partitions the
> > task-level
> > > > >>> metrics
> > > > >>>> can help users to narrow down on the specific input data
> > partitions.
> > > > >>>>
> > > > >>>>
> > > > >>>> Guozhang
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <[email protected]
> >
> > > > >> wrote:
> > > > >>>>
> > > > >>>>> Hi Matthias,
> > > > >>>>>
> > > > >>>>> No worries! Thanks for the reply.
> > > > >>>>>
> > > > >>>>> 1) There isn't a connection. I tried using the
> TopologyTestDriver
> > > to
> > > > >>> write
> > > > >>>>> a quick test exercising the current behavior and discovered
> that
> > > the
> > > > >>>>> metrics weren't available. It seemed like they should be, so I
> > > tacked
> > > > >>> it on
> > > > >>>>> to this KIP. If you feel it's inappropriate, I can pull it back
> > > out.
> > > > >>>>>
> > > > >>>>> 2) I was also concerned about that, but I figured it would come
> > up
> > > in
> > > > >>>>> discussion if I just went ahead and proposed it. And here we
> are!
> > > > >>>>>
> > > > >>>>> Here's my thought: maybe there are two classes of skips:
> > > "controlled"
> > > > >>> and
> > > > >>>>> "uncontrolled", where "controlled" means, as an app author, I
> > > > >>> deliberately
> > > > >>>>> filter out some events, and "uncontrolled" means that I simply
> > > don't
> > > > >>>>> account for some feature of the data, and the framework skips
> > them
> > > > (as
> > > > >>>>> opposed to crashing).
> > > > >>>>>
> > > > >>>>> In this breakdowns, the skips I'm adding metrics for are all
> > > > >>> uncontrolled
> > > > >>>>> skips (and we hope to measure all the uncontrolled skips). Our
> > > skips
> > > > >> are
> > > > >>>>> well documented, so it wouldn't be terrible to have an
> > application
> > > in
> > > > >>> which
> > > > >>>>> you know you expect to have tons of uncontrolled skips, but
> it's
> > > not
> > > > >>> great
> > > > >>>>> either, since you may also have some *unexpected* uncontrolled
> > > skips.
> > > > >>> It'll
> > > > >>>>> be difficult to notice, since you're probably not alerting on
> the
> > > > >> metric
> > > > >>>>> and filtering out the logs (whatever their level).
> > > > >>>>>
> > > > >>>>> I'd recommend any app author, as an alternative, to convert all
> > > > >> expected
> > > > >>>>> skips to controlled ones, by updating the topology to filter
> > those
> > > > >>> records
> > > > >>>>> out.
> > > > >>>>>
> > > > >>>>> Following from my recommendation, as a library author, I'm
> > inclined
> > > > to
> > > > >>> mark
> > > > >>>>> those logs WARN, since in my opinion, they should be concerning
> > to
> > > > the
> > > > >>> app
> > > > >>>>> authors. I'd definitely want to show, rather than hide, them by
> > > > >>> default, so
> > > > >>>>> I would pick INFO at least.
> > > > >>>>>
> > > > >>>>> That said, logging is always a tricky issue for lower-level
> > > libraries
> > > > >>> that
> > > > >>>>> run inside user code, since we don't have all the information
> we
> > > need
> > > > >> to
> > > > >>>>> make the right call.
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On your last note, yeah, I got that impression from Guozhang as
> > > well.
> > > > >>>>> Thanks for the clarification.
> > > > >>>>>
> > > > >>>>> -John
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
> > > > >> [email protected]>
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>>> John,
> > > > >>>>>>
> > > > >>>>>> sorry for my late reply and thanks for updating the KIP.
> > > > >>>>>>
> > > > >>>>>> I like your approach about "metrics are for monitoring, logs
> are
> > > for
> > > > >>>>>> debugging" -- however:
> > > > >>>>>>
> > > > >>>>>> 1) I don't see a connection between this and the task-level
> > > metrics
> > > > >>> that
> > > > >>>>>> you propose to get the metrics in `TopologyTestDriver`. I
> don't
> > > > think
> > > > >>>>>> people would monitor the `TopologyTestDriver` an thus
> wondering
> > > why
> > > > >> it
> > > > >>>>>> is important to include the metrics there? Thread-level metric
> > > might
> > > > >> be
> > > > >>>>>> easier to monitor though (ie, less different metric to
> monitor).
> > > > >>>>>>
> > > > >>>>>> 2) I am a little worried about WARN level logging and that it
> > > might
> > > > >> be
> > > > >>>>>> too chatty -- as you pointed out, it's about debugging, thus
> > DEBUG
> > > > >>> level
> > > > >>>>>> might be better. Not 100% sure about this to be honest. What
> is
> > > the
> > > > >>>>>> general assumption about the frequency for skipped records? I
> > > could
> > > > >>>>>> imagine cases for which skipped records are quite frequent and
> > > thus,
> > > > >>>>>> WARN level logs might "flood" the logs
> > > > >>>>>>
> > > > >>>>>> One final remark:
> > > > >>>>>>
> > > > >>>>>>> More
> > > > >>>>>>> generally, I would like to establish a pattern in which we
> > could
> > > > add
> > > > >>>>> new
> > > > >>>>>>> values for the "reason" tags without needing a KIP to do so.
> > > > >>>>>>
> > > > >>>>>> From my understanding, this is not feasible. Changing metrics
> is
> > > > >> always
> > > > >>>>>> considered a public API change, and we need a KIP for any
> > change.
> > > As
> > > > >> we
> > > > >>>>>> moved away from tagging, it doesn't matter for the KIP anymore
> > --
> > > > >> just
> > > > >>>>>> wanted to point it out.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> -Matthias
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
> > > > >>>>>>> Allrighty! The KIP is updated.
> > > > >>>>>>>
> > > > >>>>>>> Thanks again, all, for the feedback.
> > > > >>>>>>> -John
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <
> > [email protected]
> > > >
> > > > >>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hey Guozhang and Bill,
> > > > >>>>>>>>
> > > > >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing
> consensus,
> > > I'd
> > > > >>> like
> > > > >>>>>> to
> > > > >>>>>>>> put it in the task instead of the thread so that it'll show
> up
> > > in
> > > > >> the
> > > > >>>>>>>> TopologyTestDriver metrics as well.
> > > > >>>>>>>>
> > > > >>>>>>>> I'm leaning toward keeping the scope where it is right now,
> > but
> > > if
> > > > >>>>>> others
> > > > >>>>>>>> want to advocate for tossing in some more metrics, we can go
> > > that
> > > > >>>>> route.
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks all,
> > > > >>>>>>>> -John
> > > > >>>>>>>>
> > > > >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <
> > [email protected]
> > > >
> > > > >>>>> wrote:
> > > > >>>>>>>>
> > > > >>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
> > > > >>>>>>>>>
> > > > >>>>>>>>> I'm on the fence with providing a single level metrics,
> but I
> > > > >> think
> > > > >>>>>> we'll
> > > > >>>>>>>>> have that discussion outside of this KIP.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
> > > > >>> per-task,
> > > > >>>>>> or
> > > > >>>>>>>>>> per-processor-node) with no "reason"
> > > > >>>>>>>>>> * introduce a warn-level log detailing the
> > > > topic/partition/offset
> > > > >>>>> and
> > > > >>>>>>>>>> reason of the skipped record
> > > > >>>>>>>>>
> > > > >>>>>>>>> I'm +1 on both of these suggestions.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Finally, we have had requests in the past for some metrics
> > > around
> > > > >>>>> when
> > > > >>>>>>>>> persistent store removes an expired window.  Would adding
> > that
> > > to
> > > > >>> our
> > > > >>>>>>>>> metrics stretch the scope of this KIP too much?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks again and overall I'm +1 on this KIP
> > > > >>>>>>>>>
> > > > >>>>>>>>> Bill
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
> > > > >> [email protected]>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> The proposal sounds good to me. About "maintain only one
> > level
> > > > of
> > > > >>>>>>>>> metrics"
> > > > >>>>>>>>>> maybe we can discuss about that separately from this KIP
> > since
> > > > >> that
> > > > >>>>>>>>> would
> > > > >>>>>>>>>> be a larger scope of discussion. I agree that if we are
> > going
> > > to
> > > > >>>>>>>>> maintain
> > > > >>>>>>>>>> only one-level metrics it should be lowest level and we
> > would
> > > > let
> > > > >>>>>> users
> > > > >>>>>>>>> to
> > > > >>>>>>>>>> do the roll-ups themselves, but I'm still not fully
> > convinced
> > > > >> that
> > > > >>>>> we
> > > > >>>>>>>>>> should just provide single-level metrics, because 1) I
> think
> > > for
> > > > >>>>>>>>> different
> > > > >>>>>>>>>> metrics people may be interested to investigate into
> > different
> > > > >>>>>>>>>> granularities, e.g. for poll / commit rate these are at
> the
> > > > >> lowest
> > > > >>>>>>>>>> task-level metrics, while for process-rate / skip-rate
> they
> > > can
> > > > >> be
> > > > >>>>> as
> > > > >>>>>>>>> low
> > > > >>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups
> may
> > > not
> > > > >> be
> > > > >>>>>> very
> > > > >>>>>>>>>> straight-forward. But for 2) if someone can provide an
> > > efficient
> > > > >>> and
> > > > >>>>>>>>> easy
> > > > >>>>>>>>>> implementation of that I can be persuaded :)
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> For now I'm thinking we can add the metric on
> thread-level,
> > > > >> either
> > > > >>>>>> with
> > > > >>>>>>>>>> finer grained ones with "reason" tag plus an aggregated
> one
> > > > >> without
> > > > >>>>>> the
> > > > >>>>>>>>>> tag, or just having a single aggregated metric without the
> > tag
> > > > >>> looks
> > > > >>>>>>>>> good
> > > > >>>>>>>>>> to me.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Guozhang
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <
> > > > [email protected]
> > > > >>>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Hey Guozhang,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll
> > > provide
> > > > >> a
> > > > >>>>>>>>> list in
> > > > >>>>>>>>>>> the KIP. I was also thinking we'd better start a
> > > documentation
> > > > >>> page
> > > > >>>>>>>>> with
> > > > >>>>>>>>>>> the metrics listed.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> I'd have no problem logging a warning when we skip
> records.
> > > On
> > > > >> the
> > > > >>>>>>>>> metric
> > > > >>>>>>>>>>> front, really I'm just pushing for us to maintain only
> one
> > > > level
> > > > >>> of
> > > > >>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we
> > > don't
> > > > >>>>> have a
> > > > >>>>>>>>>>> metric per reason and log the reason instead), that's
> fine
> > by
> > > > >> me.
> > > > >>> I
> > > > >>>>>>>>> just
> > > > >>>>>>>>>>> don't think it provides a lot of extra value per
> complexity
> > > > >>>>>> (interface
> > > > >>>>>>>>>> and
> > > > >>>>>>>>>>> implementation) to maintain roll-ups at the thread level
> in
> > > > >>>>> addition
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>> lower-level metrics.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> How about this instead:
> > > > >>>>>>>>>>> * maintain one skipped-record metric (could be
> per-thread,
> > > > >>>>> per-task,
> > > > >>>>>>>>> or
> > > > >>>>>>>>>>> per-processor-node) with no "reason"
> > > > >>>>>>>>>>> * introduce a warn-level log detailing the
> > > > >> topic/partition/offset
> > > > >>>>> and
> > > > >>>>>>>>>>> reason of the skipped record
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> If you like that, I can update the KIP.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>> -John
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
> > > > >>> [email protected]
> > > > >>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts
> on
> > > > >> coarser
> > > > >>>>>>>>>>> metrics
> > > > >>>>>>>>>>>> being easier than finer ones. All the metric alerting
> > > systems
> > > > I
> > > > >>>>> have
> > > > >>>>>>>>>> used
> > > > >>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> > > tags.
> > > > >> So
> > > > >>>>> my
> > > > >>>>>>>>>>>> experience doesn't say that this is a use case. Were you
> > > > >> thinking
> > > > >>>>>>>>> of an
> > > > >>>>>>>>>>>> alerting system that makes such a pre-aggregation
> > valuable?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
> > > > >> directly
> > > > >>>>> as
> > > > >>>>>>>>>> part
> > > > >>>>>>>>>>> of
> > > > >>>>>>>>>>>> the object name, and if users wants to monitor them they
> > > need
> > > > >> to
> > > > >>>>>>>>> know
> > > > >>>>>>>>>>> these
> > > > >>>>>>>>>>>> values before hand. That is also why I think we do want
> to
> > > > list
> > > > >>>>> all
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>> possible values of the reason tags in the KIP, since
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example
> of
> > > the
> > > > >>>>>>>>> kind of
> > > > >>>>>>>>>>>> scenario that would lead me as an operator to run with
> > DEBUG
> > > > on
> > > > >>>>> all
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> > record
> > > > >>> once,
> > > > >>>>>>>>> that
> > > > >>>>>>>>>>> it
> > > > >>>>>>>>>>>> would ever happen again. The solution is to capture all
> > the
> > > > >>>>>>>>> available
> > > > >>>>>>>>>>>> information about the reason and location of skips all
> the
> > > > >> time.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> That is a good point. I think we can either expose all
> > > levels
> > > > >>>>>>>>> metrics
> > > > >>>>>>>>>> as
> > > > >>>>>>>>>>> by
> > > > >>>>>>>>>>>> default, or only expose the most lower-level metrics and
> > get
> > > > >> rid
> > > > >>>>> of
> > > > >>>>>>>>>> other
> > > > >>>>>>>>>>>> levels to let users do roll-ups themselves (which will
> be
> > a
> > > > >> much
> > > > >>>>>>>>> larger
> > > > >>>>>>>>>>>> scope for discussion), or we can encourage users to not
> > > purely
> > > > >>>>>>>>> depend
> > > > >>>>>>>>>> on
> > > > >>>>>>>>>>>> metrics for such trouble shooting: that is to say, users
> > > only
> > > > >> be
> > > > >>>>>>>>>> alerted
> > > > >>>>>>>>>>>> based on metrics, and we can log a info / warn log4j
> entry
> > > > each
> > > > >>>>>>>>> time we
> > > > >>>>>>>>>>> are
> > > > >>>>>>>>>>>> about to skip a record all over the places, so that upon
> > > being
> > > > >>>>>>>>> notified
> > > > >>>>>>>>>>>> users can look into the logs to find the details on
> where
> > /
> > > > >> when
> > > > >>>>> it
> > > > >>>>>>>>>>>> happens. WDYT?
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Guozhang
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
> > > > >> [email protected]
> > > > >>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Hey Guozhang,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks for the review.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 1.
> > > > >>>>>>>>>>>>> Matthias raised the same question about the "reason"
> tag
> > > > >> values.
> > > > >>>>> I
> > > > >>>>>>>>>> can
> > > > >>>>>>>>>>>> list
> > > > >>>>>>>>>>>>> all possible values of the "reason" tag, but I'm
> thinking
> > > > this
> > > > >>>>>>>>> level
> > > > >>>>>>>>>> of
> > > > >>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and
> > > > documentation
> > > > >>>>>>>>> review
> > > > >>>>>>>>>>>> would
> > > > >>>>>>>>>>>>> be sufficient. If you all disagree and would like it
> > > included
> > > > >> in
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>> KIP, I
> > > > >>>>>>>>>>>>> can certainly do that.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the
> > pattern
> > > of
> > > > >>>>>>>>> keeping
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>> same name but eliminating the tags for the dimensions
> > that
> > > > >> were
> > > > >>>>>>>>>>>> rolled-up.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 2.
> > > > >>>>>>>>>>>>> I'm not too sure that implementation efficiency really
> > > > >> becomes a
> > > > >>>>>>>>>> factor
> > > > >>>>>>>>>>>> in
> > > > >>>>>>>>>>>>> choosing whether to (by default) update one coarse
> metric
> > > at
> > > > >> the
> > > > >>>>>>>>>> thread
> > > > >>>>>>>>>>>>> level or one granular metric at the processor-node
> level,
> > > > >> since
> > > > >>>>>>>>> it's
> > > > >>>>>>>>>>> just
> > > > >>>>>>>>>>>>> one metric being updated either way. I do agree that if
> > we
> > > > >> were
> > > > >>>>> to
> > > > >>>>>>>>>>> update
> > > > >>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we
> > should
> > > > >>>>>>>>> consider
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> efficiency.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I agree it's probably not necessary to surface the
> > metrics
> > > > for
> > > > >>>>> all
> > > > >>>>>>>>>>> nodes
> > > > >>>>>>>>>>>>> regardless of whether they can or do skip records.
> > Perhaps
> > > we
> > > > >>> can
> > > > >>>>>>>>>>> lazily
> > > > >>>>>>>>>>>>> register the metrics.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> In my email in response to Matthias, I gave an example
> of
> > > the
> > > > >>>>>>>>> kind of
> > > > >>>>>>>>>>>>> scenario that would lead me as an operator to run with
> > > DEBUG
> > > > >> on
> > > > >>>>>>>>> all
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped
> > > record
> > > > >>>>> once,
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>> it
> > > > >>>>>>>>>>>>> would ever happen again. The solution is to capture all
> > the
> > > > >>>>>>>>> available
> > > > >>>>>>>>>>>>> information about the reason and location of skips all
> > the
> > > > >> time.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> One thing you mention is the notion of setting alerts
> on
> > > > >> coarser
> > > > >>>>>>>>>>> metrics
> > > > >>>>>>>>>>>>> being easier than finer ones. All the metric alerting
> > > systems
> > > > >> I
> > > > >>>>>>>>> have
> > > > >>>>>>>>>>> used
> > > > >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over
> > > tags.
> > > > >> So
> > > > >>>>>>>>> my
> > > > >>>>>>>>>>>>> experience doesn't say that this is a use case. Were
> you
> > > > >>> thinking
> > > > >>>>>>>>> of
> > > > >>>>>>>>>> an
> > > > >>>>>>>>>>>>> alerting system that makes such a pre-aggregation
> > valuable?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks again,
> > > > >>>>>>>>>>>>> -John
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
> > > > >>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Hello John,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Thanks for the KIP. Some comments:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> 1. Could you list all the possible values of the
> > "reason"
> > > > >> tag?
> > > > >>>>>>>>> In
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>> JIRA
> > > > >>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear
> > if
> > > > >>> you're
> > > > >>>>>>>>>>> going
> > > > >>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> categorize each of them as a separate reason, or is
> > there
> > > > any
> > > > >>>>>>>>>>>> additional
> > > > >>>>>>>>>>>>>> ones you have in mind.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Also I'm wondering if we should add another metric
> that
> > do
> > > > >> not
> > > > >>>>>>>>> have
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>>> reason tag but aggregates among all possible reasons?
> > This
> > > > is
> > > > >>>>>>>>> for
> > > > >>>>>>>>>>> users
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> easily set their alerting notifications (otherwise
> they
> > > have
> > > > >> to
> > > > >>>>>>>>>> write
> > > > >>>>>>>>>>>> on
> > > > >>>>>>>>>>>>>> notification rule per reason) in their monitoring
> > systems.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
> > > > >>> "per-thread,
> > > > >>>>>>>>>>>>> per-task,
> > > > >>>>>>>>>>>>>> per-processor-node", and today we only set the
> > per-thread
> > > > >>>>>>>>> metrics
> > > > >>>>>>>>>> as
> > > > >>>>>>>>>>>> INFO
> > > > >>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree
> > with
> > > > >> your
> > > > >>>>>>>>>>> argument
> > > > >>>>>>>>>>>>>> that we are missing the per-client roll-up metrics
> > today,
> > > > but
> > > > >>>>>>>>> I'm
> > > > >>>>>>>>>>>>> convinced
> > > > >>>>>>>>>>>>>> that the right way to approach it would be
> > > > >>>>>>>>>>> "just-providing-the-lowest-
> > > > >>>>>>>>>>>>>> level
> > > > >>>>>>>>>>>>>> metrics only".
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Note the recoding implementation of these three levels
> > are
> > > > >>>>>>>>>> different
> > > > >>>>>>>>>>>>>> internally today: we did not just do the rolling up to
> > > > >> generate
> > > > >>>>>>>>> the
> > > > >>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we
> > > just
> > > > >>>>>>>>> record
> > > > >>>>>>>>>>> them
> > > > >>>>>>>>>>>>>> separately, which means that, if we turn on multiple
> > > levels
> > > > >> of
> > > > >>>>>>>>>>> metrics,
> > > > >>>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue
> > > that
> > > > >> is
> > > > >>>>>>>>> not
> > > > >>>>>>>>>>> the
> > > > >>>>>>>>>>>>> best
> > > > >>>>>>>>>>>>>> way to represent multi-level metrics collecting and
> > > > >> reporting,
> > > > >>>>>>>>> but
> > > > >>>>>>>>>> by
> > > > >>>>>>>>>>>>> only
> > > > >>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
> > > > >>> implementation
> > > > >>>>>>>>>>> could
> > > > >>>>>>>>>>>> be
> > > > >>>>>>>>>>>>>> more efficient than only collecting the metrics at the
> > > > lowest
> > > > >>>>>>>>>> level,
> > > > >>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>> then do the roll-up calculations outside of the
> metrics
> > > > >>> classes.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
> > > > >> records,
> > > > >>>>>>>>>> AFAIK
> > > > >>>>>>>>>>> we
> > > > >>>>>>>>>>>>>> will only skip records at the source, sink, window and
> > > > >>>>>>>>> aggregation
> > > > >>>>>>>>>>>>>> processor nodes, so adding a metric per processor
> looks
> > > like
> > > > >> an
> > > > >>>>>>>>>>>> overkill
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> me as well. On the other hand, from user's perspective
> > the
> > > > >>>>>>>>> "reason"
> > > > >>>>>>>>>>> tag
> > > > >>>>>>>>>>>>> may
> > > > >>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
> > > > >> topology
> > > > >>>>>>>>> is
> > > > >>>>>>>>>>>>> causing
> > > > >>>>>>>>>>>>>> records to be dropped on the floor. So I think the
> > > > >> "per-thread,
> > > > >>>>>>>>>>>> per-task"
> > > > >>>>>>>>>>>>>> level metrics should be sufficient for them in trouble
> > > shoot
> > > > >> in
> > > > >>>>>>>>>> DEBUG
> > > > >>>>>>>>>>>>> mode,
> > > > >>>>>>>>>>>>>> and we can add another "per-thread" level metrics as
> > INFO
> > > > >> which
> > > > >>>>>>>>> is
> > > > >>>>>>>>>>>> turned
> > > > >>>>>>>>>>>>>> on by default. So under normal execution users still
> > only
> > > > >> need
> > > > >>>>>>>>> INFO
> > > > >>>>>>>>>>>> level
> > > > >>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all
> > > skipped-records
> > > > >>>>>>>>>> metrics
> > > > >>>>>>>>>>> as
> > > > >>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can
> turn
> > on
> > > > >>> DEBUG
> > > > >>>>>>>>>>>> metrics
> > > > >>>>>>>>>>>>> to
> > > > >>>>>>>>>>>>>> look into which task is actually causing the skipped
> > > > records.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Guozhang
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
> > > > >>>>>>>>>>>> [email protected]>
> > > > >>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Thanks for the KIP John.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Reading the material on the related Jiras, I am
> > wondering
> > > > >> what
> > > > >>>>>>>>>>>> `reason`
> > > > >>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The
> KIP
> > > > >> should
> > > > >>>>>>>>>> list
> > > > >>>>>>>>>>>>> those
> > > > >>>>>>>>>>>>>>> IMHO.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you
> say
> > > that
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
> > > > >>>>>>>>>> simultaneously
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not
> > > convinced
> > > > >> atm
> > > > >>>>>>>>>> that
> > > > >>>>>>>>>>>> we
> > > > >>>>>>>>>>>>>>> should put the fine grained metrics into INFO level
> and
> > > > >> remove
> > > > >>>>>>>>>> the
> > > > >>>>>>>>>>>>>>> roll-up at thread level.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
> > > > >>>>>>>>> top-level
> > > > >>>>>>>>>>> view
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> This is a fair concern, but I don't share the
> > conclusion.
> > > > >>>>>>>>>> Offering
> > > > >>>>>>>>>>> a
> > > > >>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the
> box
> > > > >> might
> > > > >>>>>>>>> be
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>>>>> better solution. In the past we did not offer this
> due
> > to
> > > > >>>>>>>>>>> performance
> > > > >>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism.
> If
> > > you
> > > > >>>>>>>>>>> disagree,
> > > > >>>>>>>>>>>>> can
> > > > >>>>>>>>>>>>>>> you provide some reasoning and add them to the
> > "Rejected
> > > > >>>>>>>>>>>> alternatives"
> > > > >>>>>>>>>>>>>>> section.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> To rephrase: I understand the issue about missing
> > > top-level
> > > > >>>>>>>>> view,
> > > > >>>>>>>>>>> but
> > > > >>>>>>>>>>>>>>> instead of going more fine grained, we should
> consider
> > to
> > > > >> add
> > > > >>>>>>>>>> this
> > > > >>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics
> at
> > > > >> DEBUG
> > > > >>>>>>>>>> level
> > > > >>>>>>>>>>>>> only
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to
> > remove
> > > > >> old
> > > > >>>>>>>>>>> metrics
> > > > >>>>>>>>>>>>>>> directly as you suggested.
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> -Matthias
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
> > > > >>>>>>>>>>>>>>>> Looks good to me.
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
> > > > >>>>>>>>>> [email protected]
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> wrote:
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Hello all,
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics
> around
> > > > >>>>>>>>> skipped
> > > > >>>>>>>>>>>> records
> > > > >>>>>>>>>>>>>> in
> > > > >>>>>>>>>>>>>>>>> Streams.
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Please find the details here:
> > > > >>>>>>>>>>>>>>>>> https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > > > >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Please let me know what you think!
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>> Thanks,
> > > > >>>>>>>>>>>>>>>>> -John
> > > > >>>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> --
> > > > >>>>>>>>>>>>>> -- Guozhang
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> --
> > > > >>>>>>>>>>>> -- Guozhang
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> --
> > > > >>>>>>>>>> -- Guozhang
> > > > >>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-274: Kafka Streams Skipped Records Metrics

Reply via email to