I am fine adding `TopologyTestDriver#metrics()`. It should be helpful to test custom metrics users can implement.
-Matthias On 4/3/18 11:35 AM, John Roesler wrote: > Oh, sorry, I missed the point. > > Yeah, we can totally do that. The reason to move it to the task level was > mainly to make it available for the metrics in TopologyTestDriver as well. > But if we decide that's a non-goal, then there's no motivation to change it. > > And actually that reminds me that we do have an open question about whether > I should add a metrics getter to the TopologyTestDriver's interface. WDYT? > > Thanks, > -John > > On Tue, Apr 3, 2018 at 1:26 PM, Guozhang Wang <wangg...@gmail.com> wrote: > >> I think Matthias' comment is that, we can still record the metrics on the >> thread-level, while having the WARN log entry to include sufficient context >> information so that users can still easily narrow down the investigation >> scope. >> >> >> Guozhang >> >> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> wrote: >> >>> I agree we should add as much information as is reasonable to the log. >> For >>> example, see this WIP PR I started for this KIP: >>> >>> https://github.com/apache/kafka/pull/4812/files#diff- >>> 88d129f048bc842c7db5b2566a45fce8R80 >>> >>> and >>> >>> https://github.com/apache/kafka/pull/4812/files#diff- >>> 69e6789eb675ec978a1abd24fed96eb1R111 >>> >>> I'm not sure if we should nail down the log messages in the KIP or in the >>> PR discussion. What say you? >>> >>> Thanks, >>> -John >>> >>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matth...@confluent.io> >>> wrote: >>> >>>> Thanks for sharing your thoughts. As I mentioned originally, I am not >>>> sure about the right log level either. Your arguments are convincing -- >>>> thus, I am fine with keeping WARN level. >>>> >>>> The task vs thread level argument is an interesting one. However, I am >>>> wondering if we should add this information into the corresponding WARN >>>> logs that we write anyway? For this case, we can also log the >>>> corresponding operator (and other information like topic name etc if >>>> needed). WDYT about this? >>>> >>>> >>>> -Matthias >>>> >>>> On 4/2/18 8:31 PM, Guozhang Wang wrote: >>>>> Regarding logging: I'm inclined to keep logging at WARN level since >>>> skipped >>>>> records are not expected in normal execution (for all reasons that we >>> are >>>>> aware of), and hence when error happens users should be alerted from >>>>> metrics and looked into the log files, so to me if it is really >>> spamming >>>>> the log files it is also a good alert for users. Besides for >>> deserialize >>>>> errors we already log at WARN level for this reason. >>>>> >>>>> Regarding the metrics-levels: I was pondering on that as well. What >>> made >>>> me >>>>> to think and agree on task-level than thread-level is that for some >>>> reasons >>>>> like window retention, they may possibly be happening on a subset of >>>> input >>>>> partitions, and tasks are correlated with partitions the task-level >>>> metrics >>>>> can help users to narrow down on the specific input data partitions. >>>>> >>>>> >>>>> Guozhang >>>>> >>>>> >>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io> >>> wrote: >>>>> >>>>>> Hi Matthias, >>>>>> >>>>>> No worries! Thanks for the reply. >>>>>> >>>>>> 1) There isn't a connection. I tried using the TopologyTestDriver to >>>> write >>>>>> a quick test exercising the current behavior and discovered that the >>>>>> metrics weren't available. It seemed like they should be, so I >> tacked >>>> it on >>>>>> to this KIP. If you feel it's inappropriate, I can pull it back out. >>>>>> >>>>>> 2) I was also concerned about that, but I figured it would come up >> in >>>>>> discussion if I just went ahead and proposed it. And here we are! >>>>>> >>>>>> Here's my thought: maybe there are two classes of skips: >> "controlled" >>>> and >>>>>> "uncontrolled", where "controlled" means, as an app author, I >>>> deliberately >>>>>> filter out some events, and "uncontrolled" means that I simply don't >>>>>> account for some feature of the data, and the framework skips them >> (as >>>>>> opposed to crashing). >>>>>> >>>>>> In this breakdowns, the skips I'm adding metrics for are all >>>> uncontrolled >>>>>> skips (and we hope to measure all the uncontrolled skips). Our skips >>> are >>>>>> well documented, so it wouldn't be terrible to have an application >> in >>>> which >>>>>> you know you expect to have tons of uncontrolled skips, but it's not >>>> great >>>>>> either, since you may also have some *unexpected* uncontrolled >> skips. >>>> It'll >>>>>> be difficult to notice, since you're probably not alerting on the >>> metric >>>>>> and filtering out the logs (whatever their level). >>>>>> >>>>>> I'd recommend any app author, as an alternative, to convert all >>> expected >>>>>> skips to controlled ones, by updating the topology to filter those >>>> records >>>>>> out. >>>>>> >>>>>> Following from my recommendation, as a library author, I'm inclined >> to >>>> mark >>>>>> those logs WARN, since in my opinion, they should be concerning to >> the >>>> app >>>>>> authors. I'd definitely want to show, rather than hide, them by >>>> default, so >>>>>> I would pick INFO at least. >>>>>> >>>>>> That said, logging is always a tricky issue for lower-level >> libraries >>>> that >>>>>> run inside user code, since we don't have all the information we >> need >>> to >>>>>> make the right call. >>>>>> >>>>>> >>>>>> >>>>>> On your last note, yeah, I got that impression from Guozhang as >> well. >>>>>> Thanks for the clarification. >>>>>> >>>>>> -John >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax < >>> matth...@confluent.io> >>>>>> wrote: >>>>>> >>>>>>> John, >>>>>>> >>>>>>> sorry for my late reply and thanks for updating the KIP. >>>>>>> >>>>>>> I like your approach about "metrics are for monitoring, logs are >> for >>>>>>> debugging" -- however: >>>>>>> >>>>>>> 1) I don't see a connection between this and the task-level metrics >>>> that >>>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't >> think >>>>>>> people would monitor the `TopologyTestDriver` an thus wondering why >>> it >>>>>>> is important to include the metrics there? Thread-level metric >> might >>> be >>>>>>> easier to monitor though (ie, less different metric to monitor). >>>>>>> >>>>>>> 2) I am a little worried about WARN level logging and that it might >>> be >>>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG >>>> level >>>>>>> might be better. Not 100% sure about this to be honest. What is the >>>>>>> general assumption about the frequency for skipped records? I could >>>>>>> imagine cases for which skipped records are quite frequent and >> thus, >>>>>>> WARN level logs might "flood" the logs >>>>>>> >>>>>>> One final remark: >>>>>>> >>>>>>>> More >>>>>>>> generally, I would like to establish a pattern in which we could >> add >>>>>> new >>>>>>>> values for the "reason" tags without needing a KIP to do so. >>>>>>> >>>>>>> From my understanding, this is not feasible. Changing metrics is >>> always >>>>>>> considered a public API change, and we need a KIP for any change. >> As >>> we >>>>>>> moved away from tagging, it doesn't matter for the KIP anymore -- >>> just >>>>>>> wanted to point it out. >>>>>>> >>>>>>> >>>>>>> -Matthias >>>>>>> >>>>>>> >>>>>>> On 3/30/18 2:47 PM, John Roesler wrote: >>>>>>>> Allrighty! The KIP is updated. >>>>>>>> >>>>>>>> Thanks again, all, for the feedback. >>>>>>>> -John >>>>>>>> >>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io> >>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Guozhang and Bill, >>>>>>>>> >>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd >>>> like >>>>>>> to >>>>>>>>> put it in the task instead of the thread so that it'll show up in >>> the >>>>>>>>> TopologyTestDriver metrics as well. >>>>>>>>> >>>>>>>>> I'm leaning toward keeping the scope where it is right now, but >> if >>>>>>> others >>>>>>>>> want to advocate for tossing in some more metrics, we can go that >>>>>> route. >>>>>>>>> >>>>>>>>> Thanks all, >>>>>>>>> -John >>>>>>>>> >>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com> >>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the KIP John, and sorry for the late comments. >>>>>>>>>> >>>>>>>>>> I'm on the fence with providing a single level metrics, but I >>> think >>>>>>> we'll >>>>>>>>>> have that discussion outside of this KIP. >>>>>>>>>> >>>>>>>>>>> * maintain one skipped-record metric (could be per-thread, >>>> per-task, >>>>>>> or >>>>>>>>>>> per-processor-node) with no "reason" >>>>>>>>>>> * introduce a warn-level log detailing the >> topic/partition/offset >>>>>> and >>>>>>>>>>> reason of the skipped record >>>>>>>>>> >>>>>>>>>> I'm +1 on both of these suggestions. >>>>>>>>>> >>>>>>>>>> Finally, we have had requests in the past for some metrics >> around >>>>>> when >>>>>>>>>> persistent store removes an expired window. Would adding that >> to >>>> our >>>>>>>>>> metrics stretch the scope of this KIP too much? >>>>>>>>>> >>>>>>>>>> Thanks again and overall I'm +1 on this KIP >>>>>>>>>> >>>>>>>>>> Bill >>>>>>>>>> >>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang < >>> wangg...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The proposal sounds good to me. About "maintain only one level >> of >>>>>>>>>> metrics" >>>>>>>>>>> maybe we can discuss about that separately from this KIP since >>> that >>>>>>>>>> would >>>>>>>>>>> be a larger scope of discussion. I agree that if we are going >> to >>>>>>>>>> maintain >>>>>>>>>>> only one-level metrics it should be lowest level and we would >> let >>>>>>> users >>>>>>>>>> to >>>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced >>> that >>>>>> we >>>>>>>>>>> should just provide single-level metrics, because 1) I think >> for >>>>>>>>>> different >>>>>>>>>>> metrics people may be interested to investigate into different >>>>>>>>>>> granularities, e.g. for poll / commit rate these are at the >>> lowest >>>>>>>>>>> task-level metrics, while for process-rate / skip-rate they can >>> be >>>>>> as >>>>>>>>>> low >>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not >>> be >>>>>>> very >>>>>>>>>>> straight-forward. But for 2) if someone can provide an >> efficient >>>> and >>>>>>>>>> easy >>>>>>>>>>> implementation of that I can be persuaded :) >>>>>>>>>>> >>>>>>>>>>> For now I'm thinking we can add the metric on thread-level, >>> either >>>>>>> with >>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one >>> without >>>>>>> the >>>>>>>>>>> tag, or just having a single aggregated metric without the tag >>>> looks >>>>>>>>>> good >>>>>>>>>>> to me. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Guozhang >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler < >> j...@confluent.io >>>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey Guozhang, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll >> provide >>> a >>>>>>>>>> list in >>>>>>>>>>>> the KIP. I was also thinking we'd better start a documentation >>>> page >>>>>>>>>> with >>>>>>>>>>>> the metrics listed. >>>>>>>>>>>> >>>>>>>>>>>> I'd have no problem logging a warning when we skip records. On >>> the >>>>>>>>>> metric >>>>>>>>>>>> front, really I'm just pushing for us to maintain only one >> level >>>> of >>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't >>>>>> have a >>>>>>>>>>>> metric per reason and log the reason instead), that's fine by >>> me. >>>> I >>>>>>>>>> just >>>>>>>>>>>> don't think it provides a lot of extra value per complexity >>>>>>> (interface >>>>>>>>>>> and >>>>>>>>>>>> implementation) to maintain roll-ups at the thread level in >>>>>> addition >>>>>>>>>> to >>>>>>>>>>>> lower-level metrics. >>>>>>>>>>>> >>>>>>>>>>>> How about this instead: >>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread, >>>>>> per-task, >>>>>>>>>> or >>>>>>>>>>>> per-processor-node) with no "reason" >>>>>>>>>>>> * introduce a warn-level log detailing the >>> topic/partition/offset >>>>>> and >>>>>>>>>>>> reason of the skipped record >>>>>>>>>>>> >>>>>>>>>>>> If you like that, I can update the KIP. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> -John >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang < >>>> wangg...@gmail.com >>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on >>> coarser >>>>>>>>>>>> metrics >>>>>>>>>>>>> being easier than finer ones. All the metric alerting >> systems I >>>>>> have >>>>>>>>>>> used >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. >>> So >>>>>> my >>>>>>>>>>>>> experience doesn't say that this is a use case. Were you >>> thinking >>>>>>>>>> of an >>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable? >>>>>>>>>>>>> >>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded >>> directly >>>>>> as >>>>>>>>>>> part >>>>>>>>>>>> of >>>>>>>>>>>>> the object name, and if users wants to monitor them they need >>> to >>>>>>>>>> know >>>>>>>>>>>> these >>>>>>>>>>>>> values before hand. That is also why I think we do want to >> list >>>>>> all >>>>>>>>>> the >>>>>>>>>>>>> possible values of the reason tags in the KIP, since >>>>>>>>>>>>> >>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of >> the >>>>>>>>>> kind of >>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG >> on >>>>>> all >>>>>>>>>> the >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record >>>> once, >>>>>>>>>> that >>>>>>>>>>>> it >>>>>>>>>>>>> would ever happen again. The solution is to capture all the >>>>>>>>>> available >>>>>>>>>>>>> information about the reason and location of skips all the >>> time. >>>>>>>>>>>>> >>>>>>>>>>>>> That is a good point. I think we can either expose all levels >>>>>>>>>> metrics >>>>>>>>>>> as >>>>>>>>>>>> by >>>>>>>>>>>>> default, or only expose the most lower-level metrics and get >>> rid >>>>>> of >>>>>>>>>>> other >>>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a >>> much >>>>>>>>>> larger >>>>>>>>>>>>> scope for discussion), or we can encourage users to not >> purely >>>>>>>>>> depend >>>>>>>>>>> on >>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users only >>> be >>>>>>>>>>> alerted >>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry >> each >>>>>>>>>> time we >>>>>>>>>>>> are >>>>>>>>>>>>> about to skip a record all over the places, so that upon >> being >>>>>>>>>> notified >>>>>>>>>>>>> users can look into the logs to find the details on where / >>> when >>>>>> it >>>>>>>>>>>>> happens. WDYT? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Guozhang >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler < >>> j...@confluent.io >>>>> >>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey Guozhang, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the review. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. >>>>>>>>>>>>>> Matthias raised the same question about the "reason" tag >>> values. >>>>>> I >>>>>>>>>>> can >>>>>>>>>>>>> list >>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking >> this >>>>>>>>>> level >>>>>>>>>>> of >>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and >> documentation >>>>>>>>>> review >>>>>>>>>>>>> would >>>>>>>>>>>>>> be sufficient. If you all disagree and would like it >> included >>> in >>>>>>>>>> the >>>>>>>>>>>>> KIP, I >>>>>>>>>>>>>> can certainly do that. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern >> of >>>>>>>>>> keeping >>>>>>>>>>>> the >>>>>>>>>>>>>> same name but eliminating the tags for the dimensions that >>> were >>>>>>>>>>>>> rolled-up. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. >>>>>>>>>>>>>> I'm not too sure that implementation efficiency really >>> becomes a >>>>>>>>>>> factor >>>>>>>>>>>>> in >>>>>>>>>>>>>> choosing whether to (by default) update one coarse metric at >>> the >>>>>>>>>>> thread >>>>>>>>>>>>>> level or one granular metric at the processor-node level, >>> since >>>>>>>>>> it's >>>>>>>>>>>> just >>>>>>>>>>>>>> one metric being updated either way. I do agree that if we >>> were >>>>>> to >>>>>>>>>>>> update >>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should >>>>>>>>>> consider >>>>>>>>>>> the >>>>>>>>>>>>>> efficiency. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics >> for >>>>>> all >>>>>>>>>>>> nodes >>>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps >> we >>>> can >>>>>>>>>>>> lazily >>>>>>>>>>>>>> register the metrics. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In my email in response to Matthias, I gave an example of >> the >>>>>>>>>> kind of >>>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG >>> on >>>>>>>>>> all >>>>>>>>>>> the >>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record >>>>>> once, >>>>>>>>>>> that >>>>>>>>>>>>> it >>>>>>>>>>>>>> would ever happen again. The solution is to capture all the >>>>>>>>>> available >>>>>>>>>>>>>> information about the reason and location of skips all the >>> time. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> One thing you mention is the notion of setting alerts on >>> coarser >>>>>>>>>>>> metrics >>>>>>>>>>>>>> being easier than finer ones. All the metric alerting >> systems >>> I >>>>>>>>>> have >>>>>>>>>>>> used >>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over >> tags. >>> So >>>>>>>>>> my >>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you >>>> thinking >>>>>>>>>> of >>>>>>>>>>> an >>>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks again, >>>>>>>>>>>>>> -John >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < >>>>>>>>>> wangg...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello John, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for the KIP. Some comments: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason" >>> tag? >>>>>>>>>> In >>>>>>>>>>> the >>>>>>>>>>>>>> JIRA >>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if >>>> you're >>>>>>>>>>>> going >>>>>>>>>>>>> to >>>>>>>>>>>>>>> categorize each of them as a separate reason, or is there >> any >>>>>>>>>>>>> additional >>>>>>>>>>>>>>> ones you have in mind. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do >>> not >>>>>>>>>> have >>>>>>>>>>>> the >>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This >> is >>>>>>>>>> for >>>>>>>>>>>> users >>>>>>>>>>>>>> to >>>>>>>>>>>>>>> easily set their alerting notifications (otherwise they >> have >>> to >>>>>>>>>>> write >>>>>>>>>>>>> on >>>>>>>>>>>>>>> notification rule per reason) in their monitoring systems. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually >>>> "per-thread, >>>>>>>>>>>>>> per-task, >>>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread >>>>>>>>>> metrics >>>>>>>>>>> as >>>>>>>>>>>>> INFO >>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with >>> your >>>>>>>>>>>> argument >>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today, >> but >>>>>>>>>> I'm >>>>>>>>>>>>>> convinced >>>>>>>>>>>>>>> that the right way to approach it would be >>>>>>>>>>>> "just-providing-the-lowest- >>>>>>>>>>>>>>> level >>>>>>>>>>>>>>> metrics only". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Note the recoding implementation of these three levels are >>>>>>>>>>> different >>>>>>>>>>>>>>> internally today: we did not just do the rolling up to >>> generate >>>>>>>>>> the >>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we just >>>>>>>>>> record >>>>>>>>>>>> them >>>>>>>>>>>>>>> separately, which means that, if we turn on multiple levels >>> of >>>>>>>>>>>> metrics, >>>>>>>>>>>>>> we >>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that >>> is >>>>>>>>>> not >>>>>>>>>>>> the >>>>>>>>>>>>>> best >>>>>>>>>>>>>>> way to represent multi-level metrics collecting and >>> reporting, >>>>>>>>>> but >>>>>>>>>>> by >>>>>>>>>>>>>> only >>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that >>>> implementation >>>>>>>>>>>> could >>>>>>>>>>>>> be >>>>>>>>>>>>>>> more efficient than only collecting the metrics at the >> lowest >>>>>>>>>>> level, >>>>>>>>>>>>> and >>>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics >>>> classes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip >>> records, >>>>>>>>>>> AFAIK >>>>>>>>>>>> we >>>>>>>>>>>>>>> will only skip records at the source, sink, window and >>>>>>>>>> aggregation >>>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks >> like >>> an >>>>>>>>>>>>> overkill >>>>>>>>>>>>>> to >>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the >>>>>>>>>> "reason" >>>>>>>>>>>> tag >>>>>>>>>>>>>> may >>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the >>> topology >>>>>>>>>> is >>>>>>>>>>>>>> causing >>>>>>>>>>>>>>> records to be dropped on the floor. So I think the >>> "per-thread, >>>>>>>>>>>>> per-task" >>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble >> shoot >>> in >>>>>>>>>>> DEBUG >>>>>>>>>>>>>> mode, >>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO >>> which >>>>>>>>>> is >>>>>>>>>>>>> turned >>>>>>>>>>>>>>> on by default. So under normal execution users still only >>> need >>>>>>>>>> INFO >>>>>>>>>>>>> level >>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all >> skipped-records >>>>>>>>>>> metrics >>>>>>>>>>>> as >>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on >>>> DEBUG >>>>>>>>>>>>> metrics >>>>>>>>>>>>>> to >>>>>>>>>>>>>>> look into which task is actually causing the skipped >> records. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Guozhang >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < >>>>>>>>>>>>> matth...@confluent.io> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the KIP John. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering >>> what >>>>>>>>>>>>> `reason` >>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP >>> should >>>>>>>>>>> list >>>>>>>>>>>>>> those >>>>>>>>>>>>>>>> IMHO. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say >> that >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions >>>>>>>>>>> simultaneously >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced >>> atm >>>>>>>>>>> that >>>>>>>>>>>>> we >>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and >>> remove >>>>>>>>>>> the >>>>>>>>>>>>>>>> roll-up at thread level. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable >>>>>>>>>> top-level >>>>>>>>>>>> view >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion. >>>>>>>>>>> Offering >>>>>>>>>>>> a >>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box >>> might >>>>>>>>>> be >>>>>>>>>>> a >>>>>>>>>>>>>>>> better solution. In the past we did not offer this due to >>>>>>>>>>>> performance >>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you >>>>>>>>>>>> disagree, >>>>>>>>>>>>>> can >>>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected >>>>>>>>>>>>> alternatives" >>>>>>>>>>>>>>>> section. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing >> top-level >>>>>>>>>> view, >>>>>>>>>>>> but >>>>>>>>>>>>>>>> instead of going more fine grained, we should consider to >>> add >>>>>>>>>>> this >>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at >>> DEBUG >>>>>>>>>>> level >>>>>>>>>>>>>> only >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove >>> old >>>>>>>>>>>> metrics >>>>>>>>>>>>>>>> directly as you suggested. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote: >>>>>>>>>>>>>>>>> Looks good to me. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < >>>>>>>>>>> j...@confluent.io >>>>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hello all, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around >>>>>>>>>> skipped >>>>>>>>>>>>> records >>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>> Streams. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Please find the details here: >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Please let me know what you think! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>> -John >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> -- Guozhang >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >> >> >> >> -- >> -- Guozhang >> >
signature.asc
Description: OpenPGP digital signature