Sounds great! The cryptic topic names can be an issue -- however, people can `describe()` their topology to map the name to the corresponding sub-topology/tasks to narrow the error down to the corresponding operators. I think, this should be "sufficient for now" for debugging.
Renaming those topic seems to be out-of-scope for this KIP. -Matthias On 4/3/18 2:45 PM, Guozhang Wang wrote: > Thanks John, your proposal looks fine to me. > > I'll go ahead and look into the PR for more details myself. > > > Guozhang > > On Tue, Apr 3, 2018 at 1:35 PM, Bill Bejeck <bbej...@gmail.com> wrote: > >> Hi John, >> >> Thanks for making the updates. >> >> I agree with the information you've included in the logs as described >> above, as log statements without enough context/information can be >> frustrating. >> >> -Bill >> >> On Tue, Apr 3, 2018 at 3:29 PM, John Roesler <j...@confluent.io> wrote: >> >>> Allrighty, how about this, then... >>> >>> I'll move the metric back to the StreamThread and maintain the existing >> tag >>> (client-id=...(per-thread client-id)). It won't be present in the >>> TopologyTestDriver's metrics. >>> >>> As a side note, I'm not sure that the location of the log messages has >>> visibility into the name of the thread or the task, or the processor >> node, >>> for that matter. But at the end of the day, I don't think it really >>> matters. >>> >>> None of those identifiers are in the public interface or user-controlled. >>> For them to be useful for debugging, users would have to gain a very deep >>> understanding of how their DSL program gets executed. From my >> perspective, >>> they are all included in metric tags only to prevent collisions between >> the >>> same metrics in different (e.g.) threads. >>> >>> I think what's important is to provide the right information in the logs >>> that users will be able to debug their issues. This is why the logs in my >>> pr include the topic/partition/offset of the offending data, as well as >> the >>> stacktrace of the exception from the deserializer (or for timestamps, the >>> extracted timestamp and the class name of their extractor). This >>> information alone should let them pinpoint the offending data and fix it. >>> >>> (I am aware that that topic name might be a repartition topic, and >>> therefore also esoteric from the user's perspective, but I think it's the >>> best we can do right now. It might be nice to explicitly take on a >>> debugging ergonomics task in the future and give all processor nodes >>> human-friendly names. Then, we could surface these names in any logs or >>> exceptions. But I'm inclined to call this out-of-scope for now.) >>> >>> Thanks again, >>> -John >>> >>> On Tue, Apr 3, 2018 at 1:40 PM, Guozhang Wang <wangg...@gmail.com> >> wrote: >>> >>>> 1. If we can indeed gather all the context information from the log4j >>>> entries I'd suggest we change to thread-level (I'm not sure if that is >>>> doable, so if John have already some WIP PR that can help us decide). >>>> >>>> 2. We can consider adding the API in TopologyTestDriver for general >>> testing >>>> purposes; that being said, I think Matthias has a good point that this >>>> alone should not be a driving motivation for us to keep this metric as >>>> task-level if 1) is true. >>>> >>>> >>>> >>>> Guozhang >>>> >>>> >>>> On Tue, Apr 3, 2018 at 11:36 AM, Matthias J. Sax < >> matth...@confluent.io> >>>> wrote: >>>> >>>>> Thanks Guozhang, that was my intent. >>>>> >>>>> @John: yes, we should not nail down the exact log message. It's just >> to >>>>> point out the trade-off. If we can get the required information in >> the >>>>> logs, we might not need task level metrics. >>>>> >>>>> >>>>> -Matthias >>>>> >>>>> On 4/3/18 11:26 AM, Guozhang Wang wrote: >>>>>> I think Matthias' comment is that, we can still record the metrics >> on >>>> the >>>>>> thread-level, while having the WARN log entry to include sufficient >>>>> context >>>>>> information so that users can still easily narrow down the >>>> investigation >>>>>> scope. >>>>>> >>>>>> >>>>>> Guozhang >>>>>> >>>>>> On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> >>>> wrote: >>>>>> >>>>>>> I agree we should add as much information as is reasonable to the >>> log. >>>>> For >>>>>>> example, see this WIP PR I started for this KIP: >>>>>>> >>>>>>> https://github.com/apache/kafka/pull/4812/files#diff- >>>>>>> 88d129f048bc842c7db5b2566a45fce8R80 >>>>>>> >>>>>>> and >>>>>>> >>>>>>> https://github.com/apache/kafka/pull/4812/files#diff- >>>>>>> 69e6789eb675ec978a1abd24fed96eb1R111 >>>>>>> >>>>>>> I'm not sure if we should nail down the log messages in the KIP or >>> in >>>>> the >>>>>>> PR discussion. What say you? >>>>>>> >>>>>>> Thanks, >>>>>>> -John >>>>>>> >>>>>>> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax < >>>> matth...@confluent.io >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for sharing your thoughts. As I mentioned originally, I am >>> not >>>>>>>> sure about the right log level either. Your arguments are >>> convincing >>>> -- >>>>>>>> thus, I am fine with keeping WARN level. >>>>>>>> >>>>>>>> The task vs thread level argument is an interesting one. >> However, I >>>> am >>>>>>>> wondering if we should add this information into the >> corresponding >>>> WARN >>>>>>>> logs that we write anyway? For this case, we can also log the >>>>>>>> corresponding operator (and other information like topic name etc >>> if >>>>>>>> needed). WDYT about this? >>>>>>>> >>>>>>>> >>>>>>>> -Matthias >>>>>>>> >>>>>>>> On 4/2/18 8:31 PM, Guozhang Wang wrote: >>>>>>>>> Regarding logging: I'm inclined to keep logging at WARN level >>> since >>>>>>>> skipped >>>>>>>>> records are not expected in normal execution (for all reasons >> that >>>> we >>>>>>> are >>>>>>>>> aware of), and hence when error happens users should be alerted >>> from >>>>>>>>> metrics and looked into the log files, so to me if it is really >>>>>>> spamming >>>>>>>>> the log files it is also a good alert for users. Besides for >>>>>>> deserialize >>>>>>>>> errors we already log at WARN level for this reason. >>>>>>>>> >>>>>>>>> Regarding the metrics-levels: I was pondering on that as well. >>> What >>>>>>> made >>>>>>>> me >>>>>>>>> to think and agree on task-level than thread-level is that for >>> some >>>>>>>> reasons >>>>>>>>> like window retention, they may possibly be happening on a >> subset >>> of >>>>>>>> input >>>>>>>>> partitions, and tasks are correlated with partitions the >>> task-level >>>>>>>> metrics >>>>>>>>> can help users to narrow down on the specific input data >>> partitions. >>>>>>>>> >>>>>>>>> >>>>>>>>> Guozhang >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io >>> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Matthias, >>>>>>>>>> >>>>>>>>>> No worries! Thanks for the reply. >>>>>>>>>> >>>>>>>>>> 1) There isn't a connection. I tried using the >> TopologyTestDriver >>>> to >>>>>>>> write >>>>>>>>>> a quick test exercising the current behavior and discovered >> that >>>> the >>>>>>>>>> metrics weren't available. It seemed like they should be, so I >>>> tacked >>>>>>>> it on >>>>>>>>>> to this KIP. If you feel it's inappropriate, I can pull it back >>>> out. >>>>>>>>>> >>>>>>>>>> 2) I was also concerned about that, but I figured it would come >>> up >>>> in >>>>>>>>>> discussion if I just went ahead and proposed it. And here we >> are! >>>>>>>>>> >>>>>>>>>> Here's my thought: maybe there are two classes of skips: >>>> "controlled" >>>>>>>> and >>>>>>>>>> "uncontrolled", where "controlled" means, as an app author, I >>>>>>>> deliberately >>>>>>>>>> filter out some events, and "uncontrolled" means that I simply >>>> don't >>>>>>>>>> account for some feature of the data, and the framework skips >>> them >>>>> (as >>>>>>>>>> opposed to crashing). >>>>>>>>>> >>>>>>>>>> In this breakdowns, the skips I'm adding metrics for are all >>>>>>>> uncontrolled >>>>>>>>>> skips (and we hope to measure all the uncontrolled skips). Our >>>> skips >>>>>>> are >>>>>>>>>> well documented, so it wouldn't be terrible to have an >>> application >>>> in >>>>>>>> which >>>>>>>>>> you know you expect to have tons of uncontrolled skips, but >> it's >>>> not >>>>>>>> great >>>>>>>>>> either, since you may also have some *unexpected* uncontrolled >>>> skips. >>>>>>>> It'll >>>>>>>>>> be difficult to notice, since you're probably not alerting on >> the >>>>>>> metric >>>>>>>>>> and filtering out the logs (whatever their level). >>>>>>>>>> >>>>>>>>>> I'd recommend any app author, as an alternative, to convert all >>>>>>> expected >>>>>>>>>> skips to controlled ones, by updating the topology to filter >>> those >>>>>>>> records >>>>>>>>>> out. >>>>>>>>>> >>>>>>>>>> Following from my recommendation, as a library author, I'm >>> inclined >>>>> to >>>>>>>> mark >>>>>>>>>> those logs WARN, since in my opinion, they should be concerning >>> to >>>>> the >>>>>>>> app >>>>>>>>>> authors. I'd definitely want to show, rather than hide, them by >>>>>>>> default, so >>>>>>>>>> I would pick INFO at least. >>>>>>>>>> >>>>>>>>>> That said, logging is always a tricky issue for lower-level >>>> libraries >>>>>>>> that >>>>>>>>>> run inside user code, since we don't have all the information >> we >>>> need >>>>>>> to >>>>>>>>>> make the right call. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On your last note, yeah, I got that impression from Guozhang as >>>> well. >>>>>>>>>> Thanks for the clarification. >>>>>>>>>> >>>>>>>>>> -John >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax < >>>>>>> matth...@confluent.io> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> John, >>>>>>>>>>> >>>>>>>>>>> sorry for my late reply and thanks for updating the KIP. >>>>>>>>>>> >>>>>>>>>>> I like your approach about "metrics are for monitoring, logs >> are >>>> for >>>>>>>>>>> debugging" -- however: >>>>>>>>>>> >>>>>>>>>>> 1) I don't see a connection between this and the task-level >>>> metrics >>>>>>>> that >>>>>>>>>>> you propose to get the metrics in `TopologyTestDriver`. I >> don't >>>>> think >>>>>>>>>>> people would monitor the `TopologyTestDriver` an thus >> wondering >>>> why >>>>>>> it >>>>>>>>>>> is important to include the metrics there? Thread-level metric >>>> might >>>>>>> be >>>>>>>>>>> easier to monitor though (ie, less different metric to >> monitor). >>>>>>>>>>> >>>>>>>>>>> 2) I am a little worried about WARN level logging and that it >>>> might >>>>>>> be >>>>>>>>>>> too chatty -- as you pointed out, it's about debugging, thus >>> DEBUG >>>>>>>> level >>>>>>>>>>> might be better. Not 100% sure about this to be honest. What >> is >>>> the >>>>>>>>>>> general assumption about the frequency for skipped records? I >>>> could >>>>>>>>>>> imagine cases for which skipped records are quite frequent and >>>> thus, >>>>>>>>>>> WARN level logs might "flood" the logs >>>>>>>>>>> >>>>>>>>>>> One final remark: >>>>>>>>>>> >>>>>>>>>>>> More >>>>>>>>>>>> generally, I would like to establish a pattern in which we >>> could >>>>> add >>>>>>>>>> new >>>>>>>>>>>> values for the "reason" tags without needing a KIP to do so. >>>>>>>>>>> >>>>>>>>>>> From my understanding, this is not feasible. Changing metrics >> is >>>>>>> always >>>>>>>>>>> considered a public API change, and we need a KIP for any >>> change. >>>> As >>>>>>> we >>>>>>>>>>> moved away from tagging, it doesn't matter for the KIP anymore >>> -- >>>>>>> just >>>>>>>>>>> wanted to point it out. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -Matthias >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 3/30/18 2:47 PM, John Roesler wrote: >>>>>>>>>>>> Allrighty! The KIP is updated. >>>>>>>>>>>> >>>>>>>>>>>> Thanks again, all, for the feedback. >>>>>>>>>>>> -John >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler < >>> j...@confluent.io >>>>> >>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey Guozhang and Bill, >>>>>>>>>>>>> >>>>>>>>>>>>> Ok, I'll update the KIP. At the risk of disturbing >> consensus, >>>> I'd >>>>>>>> like >>>>>>>>>>> to >>>>>>>>>>>>> put it in the task instead of the thread so that it'll show >> up >>>> in >>>>>>> the >>>>>>>>>>>>> TopologyTestDriver metrics as well. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm leaning toward keeping the scope where it is right now, >>> but >>>> if >>>>>>>>>>> others >>>>>>>>>>>>> want to advocate for tossing in some more metrics, we can go >>>> that >>>>>>>>>> route. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks all, >>>>>>>>>>>>> -John >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck < >>> bbej...@gmail.com >>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the KIP John, and sorry for the late comments. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm on the fence with providing a single level metrics, >> but I >>>>>>> think >>>>>>>>>>> we'll >>>>>>>>>>>>>> have that discussion outside of this KIP. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> * maintain one skipped-record metric (could be per-thread, >>>>>>>> per-task, >>>>>>>>>>> or >>>>>>>>>>>>>>> per-processor-node) with no "reason" >>>>>>>>>>>>>>> * introduce a warn-level log detailing the >>>>> topic/partition/offset >>>>>>>>>> and >>>>>>>>>>>>>>> reason of the skipped record >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm +1 on both of these suggestions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Finally, we have had requests in the past for some metrics >>>> around >>>>>>>>>> when >>>>>>>>>>>>>> persistent store removes an expired window. Would adding >>> that >>>> to >>>>>>>> our >>>>>>>>>>>>>> metrics stretch the scope of this KIP too much? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks again and overall I'm +1 on this KIP >>>>>>>>>>>>>> >>>>>>>>>>>>>> Bill >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang < >>>>>>> wangg...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The proposal sounds good to me. About "maintain only one >>> level >>>>> of >>>>>>>>>>>>>> metrics" >>>>>>>>>>>>>>> maybe we can discuss about that separately from this KIP >>> since >>>>>>> that >>>>>>>>>>>>>> would >>>>>>>>>>>>>>> be a larger scope of discussion. I agree that if we are >>> going >>>> to >>>>>>>>>>>>>> maintain >>>>>>>>>>>>>>> only one-level metrics it should be lowest level and we >>> would >>>>> let >>>>>>>>>>> users >>>>>>>>>>>>>> to >>>>>>>>>>>>>>> do the roll-ups themselves, but I'm still not fully >>> convinced >>>>>>> that >>>>>>>>>> we >>>>>>>>>>>>>>> should just provide single-level metrics, because 1) I >> think >>>> for >>>>>>>>>>>>>> different >>>>>>>>>>>>>>> metrics people may be interested to investigate into >>> different >>>>>>>>>>>>>>> granularities, e.g. for poll / commit rate these are at >> the >>>>>>> lowest >>>>>>>>>>>>>>> task-level metrics, while for process-rate / skip-rate >> they >>>> can >>>>>>> be >>>>>>>>>> as >>>>>>>>>>>>>> low >>>>>>>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups >> may >>>> not >>>>>>> be >>>>>>>>>>> very >>>>>>>>>>>>>>> straight-forward. But for 2) if someone can provide an >>>> efficient >>>>>>>> and >>>>>>>>>>>>>> easy >>>>>>>>>>>>>>> implementation of that I can be persuaded :) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For now I'm thinking we can add the metric on >> thread-level, >>>>>>> either >>>>>>>>>>> with >>>>>>>>>>>>>>> finer grained ones with "reason" tag plus an aggregated >> one >>>>>>> without >>>>>>>>>>> the >>>>>>>>>>>>>>> tag, or just having a single aggregated metric without the >>> tag >>>>>>>> looks >>>>>>>>>>>>>> good >>>>>>>>>>>>>>> to me. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Guozhang >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler < >>>>> j...@confluent.io >>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey Guozhang, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll >>>> provide >>>>>>> a >>>>>>>>>>>>>> list in >>>>>>>>>>>>>>>> the KIP. I was also thinking we'd better start a >>>> documentation >>>>>>>> page >>>>>>>>>>>>>> with >>>>>>>>>>>>>>>> the metrics listed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'd have no problem logging a warning when we skip >> records. >>>> On >>>>>>> the >>>>>>>>>>>>>> metric >>>>>>>>>>>>>>>> front, really I'm just pushing for us to maintain only >> one >>>>> level >>>>>>>> of >>>>>>>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we >>>> don't >>>>>>>>>> have a >>>>>>>>>>>>>>>> metric per reason and log the reason instead), that's >> fine >>> by >>>>>>> me. >>>>>>>> I >>>>>>>>>>>>>> just >>>>>>>>>>>>>>>> don't think it provides a lot of extra value per >> complexity >>>>>>>>>>> (interface >>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>> implementation) to maintain roll-ups at the thread level >> in >>>>>>>>>> addition >>>>>>>>>>>>>> to >>>>>>>>>>>>>>>> lower-level metrics. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> How about this instead: >>>>>>>>>>>>>>>> * maintain one skipped-record metric (could be >> per-thread, >>>>>>>>>> per-task, >>>>>>>>>>>>>> or >>>>>>>>>>>>>>>> per-processor-node) with no "reason" >>>>>>>>>>>>>>>> * introduce a warn-level log detailing the >>>>>>> topic/partition/offset >>>>>>>>>> and >>>>>>>>>>>>>>>> reason of the skipped record >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If you like that, I can update the KIP. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> -John >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang < >>>>>>>> wangg...@gmail.com >>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts >> on >>>>>>> coarser >>>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting >>>> systems >>>>> I >>>>>>>>>> have >>>>>>>>>>>>>>> used >>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over >>>> tags. >>>>>>> So >>>>>>>>>> my >>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were you >>>>>>> thinking >>>>>>>>>>>>>> of an >>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation >>> valuable? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded >>>>>>> directly >>>>>>>>>> as >>>>>>>>>>>>>>> part >>>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>> the object name, and if users wants to monitor them they >>>> need >>>>>>> to >>>>>>>>>>>>>> know >>>>>>>>>>>>>>>> these >>>>>>>>>>>>>>>>> values before hand. That is also why I think we do want >> to >>>>> list >>>>>>>>>> all >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> possible values of the reason tags in the KIP, since >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example >> of >>>> the >>>>>>>>>>>>>> kind of >>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with >>> DEBUG >>>>> on >>>>>>>>>> all >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped >>> record >>>>>>>> once, >>>>>>>>>>>>>> that >>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all >>> the >>>>>>>>>>>>>> available >>>>>>>>>>>>>>>>> information about the reason and location of skips all >> the >>>>>>> time. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> That is a good point. I think we can either expose all >>>> levels >>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>> default, or only expose the most lower-level metrics and >>> get >>>>>>> rid >>>>>>>>>> of >>>>>>>>>>>>>>> other >>>>>>>>>>>>>>>>> levels to let users do roll-ups themselves (which will >> be >>> a >>>>>>> much >>>>>>>>>>>>>> larger >>>>>>>>>>>>>>>>> scope for discussion), or we can encourage users to not >>>> purely >>>>>>>>>>>>>> depend >>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>> metrics for such trouble shooting: that is to say, users >>>> only >>>>>>> be >>>>>>>>>>>>>>> alerted >>>>>>>>>>>>>>>>> based on metrics, and we can log a info / warn log4j >> entry >>>>> each >>>>>>>>>>>>>> time we >>>>>>>>>>>>>>>> are >>>>>>>>>>>>>>>>> about to skip a record all over the places, so that upon >>>> being >>>>>>>>>>>>>> notified >>>>>>>>>>>>>>>>> users can look into the logs to find the details on >> where >>> / >>>>>>> when >>>>>>>>>> it >>>>>>>>>>>>>>>>> happens. WDYT? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Guozhang >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler < >>>>>>> j...@confluent.io >>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hey Guozhang, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks for the review. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 1. >>>>>>>>>>>>>>>>>> Matthias raised the same question about the "reason" >> tag >>>>>>> values. >>>>>>>>>> I >>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>> list >>>>>>>>>>>>>>>>>> all possible values of the "reason" tag, but I'm >> thinking >>>>> this >>>>>>>>>>>>>> level >>>>>>>>>>>>>>> of >>>>>>>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and >>>>> documentation >>>>>>>>>>>>>> review >>>>>>>>>>>>>>>>> would >>>>>>>>>>>>>>>>>> be sufficient. If you all disagree and would like it >>>> included >>>>>>> in >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> KIP, I >>>>>>>>>>>>>>>>>> can certainly do that. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the >>> pattern >>>> of >>>>>>>>>>>>>> keeping >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> same name but eliminating the tags for the dimensions >>> that >>>>>>> were >>>>>>>>>>>>>>>>> rolled-up. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2. >>>>>>>>>>>>>>>>>> I'm not too sure that implementation efficiency really >>>>>>> becomes a >>>>>>>>>>>>>>> factor >>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>> choosing whether to (by default) update one coarse >> metric >>>> at >>>>>>> the >>>>>>>>>>>>>>> thread >>>>>>>>>>>>>>>>>> level or one granular metric at the processor-node >> level, >>>>>>> since >>>>>>>>>>>>>> it's >>>>>>>>>>>>>>>> just >>>>>>>>>>>>>>>>>> one metric being updated either way. I do agree that if >>> we >>>>>>> were >>>>>>>>>> to >>>>>>>>>>>>>>>> update >>>>>>>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we >>> should >>>>>>>>>>>>>> consider >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> efficiency. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I agree it's probably not necessary to surface the >>> metrics >>>>> for >>>>>>>>>> all >>>>>>>>>>>>>>>> nodes >>>>>>>>>>>>>>>>>> regardless of whether they can or do skip records. >>> Perhaps >>>> we >>>>>>>> can >>>>>>>>>>>>>>>> lazily >>>>>>>>>>>>>>>>>> register the metrics. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In my email in response to Matthias, I gave an example >> of >>>> the >>>>>>>>>>>>>> kind of >>>>>>>>>>>>>>>>>> scenario that would lead me as an operator to run with >>>> DEBUG >>>>>>> on >>>>>>>>>>>>>> all >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped >>>> record >>>>>>>>>> once, >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>>> would ever happen again. The solution is to capture all >>> the >>>>>>>>>>>>>> available >>>>>>>>>>>>>>>>>> information about the reason and location of skips all >>> the >>>>>>> time. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> One thing you mention is the notion of setting alerts >> on >>>>>>> coarser >>>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>>> being easier than finer ones. All the metric alerting >>>> systems >>>>>>> I >>>>>>>>>>>>>> have >>>>>>>>>>>>>>>> used >>>>>>>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over >>>> tags. >>>>>>> So >>>>>>>>>>>>>> my >>>>>>>>>>>>>>>>>> experience doesn't say that this is a use case. Were >> you >>>>>>>> thinking >>>>>>>>>>>>>> of >>>>>>>>>>>>>>> an >>>>>>>>>>>>>>>>>> alerting system that makes such a pre-aggregation >>> valuable? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks again, >>>>>>>>>>>>>>>>>> -John >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < >>>>>>>>>>>>>> wangg...@gmail.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hello John, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks for the KIP. Some comments: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. Could you list all the possible values of the >>> "reason" >>>>>>> tag? >>>>>>>>>>>>>> In >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> JIRA >>>>>>>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear >>> if >>>>>>>> you're >>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> categorize each of them as a separate reason, or is >>> there >>>>> any >>>>>>>>>>>>>>>>> additional >>>>>>>>>>>>>>>>>>> ones you have in mind. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Also I'm wondering if we should add another metric >> that >>> do >>>>>>> not >>>>>>>>>>>>>> have >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? >>> This >>>>> is >>>>>>>>>>>>>> for >>>>>>>>>>>>>>>> users >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> easily set their alerting notifications (otherwise >> they >>>> have >>>>>>> to >>>>>>>>>>>>>>> write >>>>>>>>>>>>>>>>> on >>>>>>>>>>>>>>>>>>> notification rule per reason) in their monitoring >>> systems. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually >>>>>>>> "per-thread, >>>>>>>>>>>>>>>>>> per-task, >>>>>>>>>>>>>>>>>>> per-processor-node", and today we only set the >>> per-thread >>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>> INFO >>>>>>>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree >>> with >>>>>>> your >>>>>>>>>>>>>>>> argument >>>>>>>>>>>>>>>>>>> that we are missing the per-client roll-up metrics >>> today, >>>>> but >>>>>>>>>>>>>> I'm >>>>>>>>>>>>>>>>>> convinced >>>>>>>>>>>>>>>>>>> that the right way to approach it would be >>>>>>>>>>>>>>>> "just-providing-the-lowest- >>>>>>>>>>>>>>>>>>> level >>>>>>>>>>>>>>>>>>> metrics only". >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Note the recoding implementation of these three levels >>> are >>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>>>> internally today: we did not just do the rolling up to >>>>>>> generate >>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we >>>> just >>>>>>>>>>>>>> record >>>>>>>>>>>>>>>> them >>>>>>>>>>>>>>>>>>> separately, which means that, if we turn on multiple >>>> levels >>>>>>> of >>>>>>>>>>>>>>>> metrics, >>>>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue >>>> that >>>>>>> is >>>>>>>>>>>>>> not >>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>> best >>>>>>>>>>>>>>>>>>> way to represent multi-level metrics collecting and >>>>>>> reporting, >>>>>>>>>>>>>> but >>>>>>>>>>>>>>> by >>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that >>>>>>>> implementation >>>>>>>>>>>>>>>> could >>>>>>>>>>>>>>>>> be >>>>>>>>>>>>>>>>>>> more efficient than only collecting the metrics at the >>>>> lowest >>>>>>>>>>>>>>> level, >>>>>>>>>>>>>>>>> and >>>>>>>>>>>>>>>>>>> then do the roll-up calculations outside of the >> metrics >>>>>>>> classes. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip >>>>>>> records, >>>>>>>>>>>>>>> AFAIK >>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>>>> will only skip records at the source, sink, window and >>>>>>>>>>>>>> aggregation >>>>>>>>>>>>>>>>>>> processor nodes, so adding a metric per processor >> looks >>>> like >>>>>>> an >>>>>>>>>>>>>>>>> overkill >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> me as well. On the other hand, from user's perspective >>> the >>>>>>>>>>>>>> "reason" >>>>>>>>>>>>>>>> tag >>>>>>>>>>>>>>>>>> may >>>>>>>>>>>>>>>>>>> be sufficient for them to narrow down where inside the >>>>>>> topology >>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>> causing >>>>>>>>>>>>>>>>>>> records to be dropped on the floor. So I think the >>>>>>> "per-thread, >>>>>>>>>>>>>>>>> per-task" >>>>>>>>>>>>>>>>>>> level metrics should be sufficient for them in trouble >>>> shoot >>>>>>> in >>>>>>>>>>>>>>> DEBUG >>>>>>>>>>>>>>>>>> mode, >>>>>>>>>>>>>>>>>>> and we can add another "per-thread" level metrics as >>> INFO >>>>>>> which >>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>> turned >>>>>>>>>>>>>>>>>>> on by default. So under normal execution users still >>> only >>>>>>> need >>>>>>>>>>>>>> INFO >>>>>>>>>>>>>>>>> level >>>>>>>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all >>>> skipped-records >>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>> as >>>>>>>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can >> turn >>> on >>>>>>>> DEBUG >>>>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>>>>>> look into which task is actually causing the skipped >>>>> records. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Guozhang >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < >>>>>>>>>>>>>>>>> matth...@confluent.io> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks for the KIP John. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Reading the material on the related Jiras, I am >>> wondering >>>>>>> what >>>>>>>>>>>>>>>>> `reason` >>>>>>>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The >> KIP >>>>>>> should >>>>>>>>>>>>>>> list >>>>>>>>>>>>>>>>>> those >>>>>>>>>>>>>>>>>>>> IMHO. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you >> say >>>> that >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions >>>>>>>>>>>>>>> simultaneously >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not >>>> convinced >>>>>>> atm >>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> we >>>>>>>>>>>>>>>>>>>> should put the fine grained metrics into INFO level >> and >>>>>>> remove >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>> roll-up at thread level. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable >>>>>>>>>>>>>> top-level >>>>>>>>>>>>>>>> view >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> This is a fair concern, but I don't share the >>> conclusion. >>>>>>>>>>>>>>> Offering >>>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the >> box >>>>>>> might >>>>>>>>>>>>>> be >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>>>>>>> better solution. In the past we did not offer this >> due >>> to >>>>>>>>>>>>>>>> performance >>>>>>>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. >> If >>>> you >>>>>>>>>>>>>>>> disagree, >>>>>>>>>>>>>>>>>> can >>>>>>>>>>>>>>>>>>>> you provide some reasoning and add them to the >>> "Rejected >>>>>>>>>>>>>>>>> alternatives" >>>>>>>>>>>>>>>>>>>> section. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> To rephrase: I understand the issue about missing >>>> top-level >>>>>>>>>>>>>> view, >>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>> instead of going more fine grained, we should >> consider >>> to >>>>>>> add >>>>>>>>>>>>>>> this >>>>>>>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics >> at >>>>>>> DEBUG >>>>>>>>>>>>>>> level >>>>>>>>>>>>>>>>>> only >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to >>> remove >>>>>>> old >>>>>>>>>>>>>>>> metrics >>>>>>>>>>>>>>>>>>>> directly as you suggested. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote: >>>>>>>>>>>>>>>>>>>>> Looks good to me. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < >>>>>>>>>>>>>>> j...@confluent.io >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hello all, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics >> around >>>>>>>>>>>>>> skipped >>>>>>>>>>>>>>>>> records >>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>>>>> Streams. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Please find the details here: >>>>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/ >>> confluence/display/KAFKA/KIP- >>>>>>>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Please let me know what you think! >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>> -John >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> -- Guozhang >>>> >>> >> > > >
signature.asc
Description: OpenPGP digital signature