Thanks Guozhang, that was my intent.

@John: yes, we should not nail down the exact log message. It's just to
point out the trade-off. If we can get the required information in the
logs, we might not need task level metrics.


-Matthias

On 4/3/18 11:26 AM, Guozhang Wang wrote:
> I think Matthias' comment is that, we can still record the metrics on the
> thread-level, while having the WARN log entry to include sufficient context
> information so that users can still easily narrow down the investigation
> scope.
> 
> 
> Guozhang
> 
> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> wrote:
> 
>> I agree we should add as much information as is reasonable to the log. For
>> example, see this WIP PR I started for this KIP:
>>
>> https://github.com/apache/kafka/pull/4812/files#diff-
>> 88d129f048bc842c7db5b2566a45fce8R80
>>
>> and
>>
>> https://github.com/apache/kafka/pull/4812/files#diff-
>> 69e6789eb675ec978a1abd24fed96eb1R111
>>
>> I'm not sure if we should nail down the log messages in the KIP or in the
>> PR discussion. What say you?
>>
>> Thanks,
>> -John
>>
>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matth...@confluent.io>
>> wrote:
>>
>>> Thanks for sharing your thoughts. As I mentioned originally, I am not
>>> sure about the right log level either. Your arguments are convincing --
>>> thus, I am fine with keeping WARN level.
>>>
>>> The task vs thread level argument is an interesting one. However, I am
>>> wondering if we should add this information into the corresponding WARN
>>> logs that we write anyway? For this case, we can also log the
>>> corresponding operator (and other information like topic name etc if
>>> needed). WDYT about this?
>>>
>>>
>>> -Matthias
>>>
>>> On 4/2/18 8:31 PM, Guozhang Wang wrote:
>>>> Regarding logging: I'm inclined to keep logging at WARN level since
>>> skipped
>>>> records are not expected in normal execution (for all reasons that we
>> are
>>>> aware of), and hence when error happens users should be alerted from
>>>> metrics and looked into the log files, so to me if it is really
>> spamming
>>>> the log files it is also a good alert for users. Besides for
>> deserialize
>>>> errors we already log at WARN level for this reason.
>>>>
>>>> Regarding the metrics-levels: I was pondering on that as well. What
>> made
>>> me
>>>> to think and agree on task-level than thread-level is that for some
>>> reasons
>>>> like window retention, they may possibly be happening on a subset of
>>> input
>>>> partitions, and tasks are correlated with partitions the task-level
>>> metrics
>>>> can help users to narrow down on the specific input data partitions.
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io>
>> wrote:
>>>>
>>>>> Hi Matthias,
>>>>>
>>>>> No worries! Thanks for the reply.
>>>>>
>>>>> 1) There isn't a connection. I tried using the TopologyTestDriver to
>>> write
>>>>> a quick test exercising the current behavior and discovered that the
>>>>> metrics weren't available. It seemed like they should be, so I tacked
>>> it on
>>>>> to this KIP. If you feel it's inappropriate, I can pull it back out.
>>>>>
>>>>> 2) I was also concerned about that, but I figured it would come up in
>>>>> discussion if I just went ahead and proposed it. And here we are!
>>>>>
>>>>> Here's my thought: maybe there are two classes of skips: "controlled"
>>> and
>>>>> "uncontrolled", where "controlled" means, as an app author, I
>>> deliberately
>>>>> filter out some events, and "uncontrolled" means that I simply don't
>>>>> account for some feature of the data, and the framework skips them (as
>>>>> opposed to crashing).
>>>>>
>>>>> In this breakdowns, the skips I'm adding metrics for are all
>>> uncontrolled
>>>>> skips (and we hope to measure all the uncontrolled skips). Our skips
>> are
>>>>> well documented, so it wouldn't be terrible to have an application in
>>> which
>>>>> you know you expect to have tons of uncontrolled skips, but it's not
>>> great
>>>>> either, since you may also have some *unexpected* uncontrolled skips.
>>> It'll
>>>>> be difficult to notice, since you're probably not alerting on the
>> metric
>>>>> and filtering out the logs (whatever their level).
>>>>>
>>>>> I'd recommend any app author, as an alternative, to convert all
>> expected
>>>>> skips to controlled ones, by updating the topology to filter those
>>> records
>>>>> out.
>>>>>
>>>>> Following from my recommendation, as a library author, I'm inclined to
>>> mark
>>>>> those logs WARN, since in my opinion, they should be concerning to the
>>> app
>>>>> authors. I'd definitely want to show, rather than hide, them by
>>> default, so
>>>>> I would pick INFO at least.
>>>>>
>>>>> That said, logging is always a tricky issue for lower-level libraries
>>> that
>>>>> run inside user code, since we don't have all the information we need
>> to
>>>>> make the right call.
>>>>>
>>>>>
>>>>>
>>>>> On your last note, yeah, I got that impression from Guozhang as well.
>>>>> Thanks for the clarification.
>>>>>
>>>>> -John
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax <
>> matth...@confluent.io>
>>>>> wrote:
>>>>>
>>>>>> John,
>>>>>>
>>>>>> sorry for my late reply and thanks for updating the KIP.
>>>>>>
>>>>>> I like your approach about "metrics are for monitoring, logs are for
>>>>>> debugging" -- however:
>>>>>>
>>>>>> 1) I don't see a connection between this and the task-level metrics
>>> that
>>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't think
>>>>>> people would monitor the `TopologyTestDriver` an thus wondering why
>> it
>>>>>> is important to include the metrics there? Thread-level metric might
>> be
>>>>>> easier to monitor though (ie, less different metric to monitor).
>>>>>>
>>>>>> 2) I am a little worried about WARN level logging and that it might
>> be
>>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG
>>> level
>>>>>> might be better. Not 100% sure about this to be honest. What is the
>>>>>> general assumption about the frequency for skipped records? I could
>>>>>> imagine cases for which skipped records are quite frequent and thus,
>>>>>> WARN level logs might "flood" the logs
>>>>>>
>>>>>> One final remark:
>>>>>>
>>>>>>> More
>>>>>>> generally, I would like to establish a pattern in which we could add
>>>>> new
>>>>>>> values for the "reason" tags without needing a KIP to do so.
>>>>>>
>>>>>> From my understanding, this is not feasible. Changing metrics is
>> always
>>>>>> considered a public API change, and we need a KIP for any change. As
>> we
>>>>>> moved away from tagging, it doesn't matter for the KIP anymore --
>> just
>>>>>> wanted to point it out.
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 3/30/18 2:47 PM, John Roesler wrote:
>>>>>>> Allrighty! The KIP is updated.
>>>>>>>
>>>>>>> Thanks again, all, for the feedback.
>>>>>>> -John
>>>>>>>
>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io>
>>>>> wrote:
>>>>>>>
>>>>>>>> Hey Guozhang and Bill,
>>>>>>>>
>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd
>>> like
>>>>>> to
>>>>>>>> put it in the task instead of the thread so that it'll show up in
>> the
>>>>>>>> TopologyTestDriver metrics as well.
>>>>>>>>
>>>>>>>> I'm leaning toward keeping the scope where it is right now, but if
>>>>>> others
>>>>>>>> want to advocate for tossing in some more metrics, we can go that
>>>>> route.
>>>>>>>>
>>>>>>>> Thanks all,
>>>>>>>> -John
>>>>>>>>
>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the KIP John, and sorry for the late comments.
>>>>>>>>>
>>>>>>>>> I'm on the fence with providing a single level metrics, but I
>> think
>>>>>> we'll
>>>>>>>>> have that discussion outside of this KIP.
>>>>>>>>>
>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>> per-task,
>>>>>> or
>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>> * introduce a warn-level log detailing the topic/partition/offset
>>>>> and
>>>>>>>>>> reason of the skipped record
>>>>>>>>>
>>>>>>>>> I'm +1 on both of these suggestions.
>>>>>>>>>
>>>>>>>>> Finally, we have had requests in the past for some metrics around
>>>>> when
>>>>>>>>> persistent store removes an expired window.  Would adding that to
>>> our
>>>>>>>>> metrics stretch the scope of this KIP too much?
>>>>>>>>>
>>>>>>>>> Thanks again and overall I'm +1 on this KIP
>>>>>>>>>
>>>>>>>>> Bill
>>>>>>>>>
>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang <
>> wangg...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The proposal sounds good to me. About "maintain only one level of
>>>>>>>>> metrics"
>>>>>>>>>> maybe we can discuss about that separately from this KIP since
>> that
>>>>>>>>> would
>>>>>>>>>> be a larger scope of discussion. I agree that if we are going to
>>>>>>>>> maintain
>>>>>>>>>> only one-level metrics it should be lowest level and we would let
>>>>>> users
>>>>>>>>> to
>>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced
>> that
>>>>> we
>>>>>>>>>> should just provide single-level metrics, because 1) I think for
>>>>>>>>> different
>>>>>>>>>> metrics people may be interested to investigate into different
>>>>>>>>>> granularities, e.g. for poll / commit rate these are at the
>> lowest
>>>>>>>>>> task-level metrics, while for process-rate / skip-rate they can
>> be
>>>>> as
>>>>>>>>> low
>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not
>> be
>>>>>> very
>>>>>>>>>> straight-forward. But for 2) if someone can provide an efficient
>>> and
>>>>>>>>> easy
>>>>>>>>>> implementation of that I can be persuaded :)
>>>>>>>>>>
>>>>>>>>>> For now I'm thinking we can add the metric on thread-level,
>> either
>>>>>> with
>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one
>> without
>>>>>> the
>>>>>>>>>> tag, or just having a single aggregated metric without the tag
>>> looks
>>>>>>>>> good
>>>>>>>>>> to me.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Guozhang
>>>>>>>>>>
>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io
>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide
>> a
>>>>>>>>> list in
>>>>>>>>>>> the KIP. I was also thinking we'd better start a documentation
>>> page
>>>>>>>>> with
>>>>>>>>>>> the metrics listed.
>>>>>>>>>>>
>>>>>>>>>>> I'd have no problem logging a warning when we skip records. On
>> the
>>>>>>>>> metric
>>>>>>>>>>> front, really I'm just pushing for us to maintain only one level
>>> of
>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't
>>>>> have a
>>>>>>>>>>> metric per reason and log the reason instead), that's fine by
>> me.
>>> I
>>>>>>>>> just
>>>>>>>>>>> don't think it provides a lot of extra value per complexity
>>>>>> (interface
>>>>>>>>>> and
>>>>>>>>>>> implementation) to maintain roll-ups at the thread level in
>>>>> addition
>>>>>>>>> to
>>>>>>>>>>> lower-level metrics.
>>>>>>>>>>>
>>>>>>>>>>> How about this instead:
>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread,
>>>>> per-task,
>>>>>>>>> or
>>>>>>>>>>> per-processor-node) with no "reason"
>>>>>>>>>>> * introduce a warn-level log detailing the
>> topic/partition/offset
>>>>> and
>>>>>>>>>>> reason of the skipped record
>>>>>>>>>>>
>>>>>>>>>>> If you like that, I can update the KIP.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -John
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang <
>>> wangg...@gmail.com
>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>> coarser
>>>>>>>>>>> metrics
>>>>>>>>>>>> being easier than finer ones. All the metric alerting systems I
>>>>> have
>>>>>>>>>> used
>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
>> So
>>>>> my
>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>> thinking
>>>>>>>>> of an
>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>
>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded
>> directly
>>>>> as
>>>>>>>>>> part
>>>>>>>>>>> of
>>>>>>>>>>>> the object name, and if users wants to monitor them they need
>> to
>>>>>>>>> know
>>>>>>>>>>> these
>>>>>>>>>>>> values before hand. That is also why I think we do want to list
>>>>> all
>>>>>>>>> the
>>>>>>>>>>>> possible values of the reason tags in the KIP, since
>>>>>>>>>>>>
>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the
>>>>>>>>> kind of
>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG on
>>>>> all
>>>>>>>>> the
>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>> once,
>>>>>>>>> that
>>>>>>>>>>> it
>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>> available
>>>>>>>>>>>> information about the reason and location of skips all the
>> time.
>>>>>>>>>>>>
>>>>>>>>>>>> That is a good point. I think we can either expose all levels
>>>>>>>>> metrics
>>>>>>>>>> as
>>>>>>>>>>> by
>>>>>>>>>>>> default, or only expose the most lower-level metrics and get
>> rid
>>>>> of
>>>>>>>>>> other
>>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a
>> much
>>>>>>>>> larger
>>>>>>>>>>>> scope for discussion), or we can encourage users to not purely
>>>>>>>>> depend
>>>>>>>>>> on
>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users only
>> be
>>>>>>>>>> alerted
>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry each
>>>>>>>>> time we
>>>>>>>>>>> are
>>>>>>>>>>>> about to skip a record all over the places, so that upon being
>>>>>>>>> notified
>>>>>>>>>>>> users can look into the logs to find the details on where /
>> when
>>>>> it
>>>>>>>>>>>> happens. WDYT?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler <
>> j...@confluent.io
>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey Guozhang,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the review.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.
>>>>>>>>>>>>> Matthias raised the same question about the "reason" tag
>> values.
>>>>> I
>>>>>>>>>> can
>>>>>>>>>>>> list
>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this
>>>>>>>>> level
>>>>>>>>>> of
>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation
>>>>>>>>> review
>>>>>>>>>>>> would
>>>>>>>>>>>>> be sufficient. If you all disagree and would like it included
>> in
>>>>>>>>> the
>>>>>>>>>>>> KIP, I
>>>>>>>>>>>>> can certainly do that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of
>>>>>>>>> keeping
>>>>>>>>>>> the
>>>>>>>>>>>>> same name but eliminating the tags for the dimensions that
>> were
>>>>>>>>>>>> rolled-up.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.
>>>>>>>>>>>>> I'm not too sure that implementation efficiency really
>> becomes a
>>>>>>>>>> factor
>>>>>>>>>>>> in
>>>>>>>>>>>>> choosing whether to (by default) update one coarse metric at
>> the
>>>>>>>>>> thread
>>>>>>>>>>>>> level or one granular metric at the processor-node level,
>> since
>>>>>>>>> it's
>>>>>>>>>>> just
>>>>>>>>>>>>> one metric being updated either way. I do agree that if we
>> were
>>>>> to
>>>>>>>>>>> update
>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should
>>>>>>>>> consider
>>>>>>>>>> the
>>>>>>>>>>>>> efficiency.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics for
>>>>> all
>>>>>>>>>>> nodes
>>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps we
>>> can
>>>>>>>>>>> lazily
>>>>>>>>>>>>> register the metrics.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the
>>>>>>>>> kind of
>>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG
>> on
>>>>>>>>> all
>>>>>>>>>> the
>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record
>>>>> once,
>>>>>>>>>> that
>>>>>>>>>>>> it
>>>>>>>>>>>>> would ever happen again. The solution is to capture all the
>>>>>>>>> available
>>>>>>>>>>>>> information about the reason and location of skips all the
>> time.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on
>> coarser
>>>>>>>>>>> metrics
>>>>>>>>>>>>> being easier than finer ones. All the metric alerting systems
>> I
>>>>>>>>> have
>>>>>>>>>>> used
>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags.
>> So
>>>>>>>>> my
>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you
>>> thinking
>>>>>>>>> of
>>>>>>>>>> an
>>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>> -John
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang <
>>>>>>>>> wangg...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello John,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the KIP. Some comments:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason"
>> tag?
>>>>>>>>> In
>>>>>>>>>> the
>>>>>>>>>>>>> JIRA
>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if
>>> you're
>>>>>>>>>>> going
>>>>>>>>>>>> to
>>>>>>>>>>>>>> categorize each of them as a separate reason, or is there any
>>>>>>>>>>>> additional
>>>>>>>>>>>>>> ones you have in mind.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do
>> not
>>>>>>>>> have
>>>>>>>>>>> the
>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This is
>>>>>>>>> for
>>>>>>>>>>> users
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> easily set their alerting notifications (otherwise they have
>> to
>>>>>>>>>> write
>>>>>>>>>>>> on
>>>>>>>>>>>>>> notification rule per reason) in their monitoring systems.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually
>>> "per-thread,
>>>>>>>>>>>>> per-task,
>>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread
>>>>>>>>> metrics
>>>>>>>>>> as
>>>>>>>>>>>> INFO
>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with
>> your
>>>>>>>>>>> argument
>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today, but
>>>>>>>>> I'm
>>>>>>>>>>>>> convinced
>>>>>>>>>>>>>> that the right way to approach it would be
>>>>>>>>>>> "just-providing-the-lowest-
>>>>>>>>>>>>>> level
>>>>>>>>>>>>>> metrics only".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Note the recoding implementation of these three levels are
>>>>>>>>>> different
>>>>>>>>>>>>>> internally today: we did not just do the rolling up to
>> generate
>>>>>>>>> the
>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we just
>>>>>>>>> record
>>>>>>>>>>> them
>>>>>>>>>>>>>> separately, which means that, if we turn on multiple levels
>> of
>>>>>>>>>>> metrics,
>>>>>>>>>>>>> we
>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that
>> is
>>>>>>>>> not
>>>>>>>>>>> the
>>>>>>>>>>>>> best
>>>>>>>>>>>>>> way to represent multi-level metrics collecting and
>> reporting,
>>>>>>>>> but
>>>>>>>>>> by
>>>>>>>>>>>>> only
>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that
>>> implementation
>>>>>>>>>>> could
>>>>>>>>>>>> be
>>>>>>>>>>>>>> more efficient than only collecting the metrics at the lowest
>>>>>>>>>> level,
>>>>>>>>>>>> and
>>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics
>>> classes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip
>> records,
>>>>>>>>>> AFAIK
>>>>>>>>>>> we
>>>>>>>>>>>>>> will only skip records at the source, sink, window and
>>>>>>>>> aggregation
>>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks like
>> an
>>>>>>>>>>>> overkill
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the
>>>>>>>>> "reason"
>>>>>>>>>>> tag
>>>>>>>>>>>>> may
>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the
>> topology
>>>>>>>>> is
>>>>>>>>>>>>> causing
>>>>>>>>>>>>>> records to be dropped on the floor. So I think the
>> "per-thread,
>>>>>>>>>>>> per-task"
>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble shoot
>> in
>>>>>>>>>> DEBUG
>>>>>>>>>>>>> mode,
>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO
>> which
>>>>>>>>> is
>>>>>>>>>>>> turned
>>>>>>>>>>>>>> on by default. So under normal execution users still only
>> need
>>>>>>>>> INFO
>>>>>>>>>>>> level
>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records
>>>>>>>>>> metrics
>>>>>>>>>>> as
>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on
>>> DEBUG
>>>>>>>>>>>> metrics
>>>>>>>>>>>>> to
>>>>>>>>>>>>>> look into which task is actually causing the skipped records.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Guozhang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax <
>>>>>>>>>>>> matth...@confluent.io>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the KIP John.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering
>> what
>>>>>>>>>>>> `reason`
>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP
>> should
>>>>>>>>>> list
>>>>>>>>>>>>> those
>>>>>>>>>>>>>>> IMHO.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions
>>>>>>>>>> simultaneously
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced
>> atm
>>>>>>>>>> that
>>>>>>>>>>>> we
>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and
>> remove
>>>>>>>>>> the
>>>>>>>>>>>>>>> roll-up at thread level.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable
>>>>>>>>> top-level
>>>>>>>>>>> view
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion.
>>>>>>>>>> Offering
>>>>>>>>>>> a
>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box
>> might
>>>>>>>>> be
>>>>>>>>>> a
>>>>>>>>>>>>>>> better solution. In the past we did not offer this due to
>>>>>>>>>>> performance
>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you
>>>>>>>>>>> disagree,
>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected
>>>>>>>>>>>> alternatives"
>>>>>>>>>>>>>>> section.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing top-level
>>>>>>>>> view,
>>>>>>>>>>> but
>>>>>>>>>>>>>>> instead of going more fine grained, we should consider to
>> add
>>>>>>>>>> this
>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at
>> DEBUG
>>>>>>>>>> level
>>>>>>>>>>>>> only
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove
>> old
>>>>>>>>>>> metrics
>>>>>>>>>>>>>>> directly as you suggested.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote:
>>>>>>>>>>>>>>>> Looks good to me.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler <
>>>>>>>>>> j...@confluent.io
>>>>>>>>>>>>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around
>>>>>>>>> skipped
>>>>>>>>>>>> records
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>> Streams.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please find the details here:
>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please let me know what you think!
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> -John
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> -- Guozhang
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> -- Guozhang
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
> 
> 
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to