Thanks Guozhang, that was my intent. @John: yes, we should not nail down the exact log message. It's just to point out the trade-off. If we can get the required information in the logs, we might not need task level metrics.
-Matthias On 4/3/18 11:26 AM, Guozhang Wang wrote: > I think Matthias' comment is that, we can still record the metrics on the > thread-level, while having the WARN log entry to include sufficient context > information so that users can still easily narrow down the investigation > scope. > > > Guozhang > > On Tue, Apr 3, 2018 at 11:22 AM, John Roesler <j...@confluent.io> wrote: > >> I agree we should add as much information as is reasonable to the log. For >> example, see this WIP PR I started for this KIP: >> >> https://github.com/apache/kafka/pull/4812/files#diff- >> 88d129f048bc842c7db5b2566a45fce8R80 >> >> and >> >> https://github.com/apache/kafka/pull/4812/files#diff- >> 69e6789eb675ec978a1abd24fed96eb1R111 >> >> I'm not sure if we should nail down the log messages in the KIP or in the >> PR discussion. What say you? >> >> Thanks, >> -John >> >> On Tue, Apr 3, 2018 at 12:20 AM, Matthias J. Sax <matth...@confluent.io> >> wrote: >> >>> Thanks for sharing your thoughts. As I mentioned originally, I am not >>> sure about the right log level either. Your arguments are convincing -- >>> thus, I am fine with keeping WARN level. >>> >>> The task vs thread level argument is an interesting one. However, I am >>> wondering if we should add this information into the corresponding WARN >>> logs that we write anyway? For this case, we can also log the >>> corresponding operator (and other information like topic name etc if >>> needed). WDYT about this? >>> >>> >>> -Matthias >>> >>> On 4/2/18 8:31 PM, Guozhang Wang wrote: >>>> Regarding logging: I'm inclined to keep logging at WARN level since >>> skipped >>>> records are not expected in normal execution (for all reasons that we >> are >>>> aware of), and hence when error happens users should be alerted from >>>> metrics and looked into the log files, so to me if it is really >> spamming >>>> the log files it is also a good alert for users. Besides for >> deserialize >>>> errors we already log at WARN level for this reason. >>>> >>>> Regarding the metrics-levels: I was pondering on that as well. What >> made >>> me >>>> to think and agree on task-level than thread-level is that for some >>> reasons >>>> like window retention, they may possibly be happening on a subset of >>> input >>>> partitions, and tasks are correlated with partitions the task-level >>> metrics >>>> can help users to narrow down on the specific input data partitions. >>>> >>>> >>>> Guozhang >>>> >>>> >>>> On Mon, Apr 2, 2018 at 6:43 PM, John Roesler <j...@confluent.io> >> wrote: >>>> >>>>> Hi Matthias, >>>>> >>>>> No worries! Thanks for the reply. >>>>> >>>>> 1) There isn't a connection. I tried using the TopologyTestDriver to >>> write >>>>> a quick test exercising the current behavior and discovered that the >>>>> metrics weren't available. It seemed like they should be, so I tacked >>> it on >>>>> to this KIP. If you feel it's inappropriate, I can pull it back out. >>>>> >>>>> 2) I was also concerned about that, but I figured it would come up in >>>>> discussion if I just went ahead and proposed it. And here we are! >>>>> >>>>> Here's my thought: maybe there are two classes of skips: "controlled" >>> and >>>>> "uncontrolled", where "controlled" means, as an app author, I >>> deliberately >>>>> filter out some events, and "uncontrolled" means that I simply don't >>>>> account for some feature of the data, and the framework skips them (as >>>>> opposed to crashing). >>>>> >>>>> In this breakdowns, the skips I'm adding metrics for are all >>> uncontrolled >>>>> skips (and we hope to measure all the uncontrolled skips). Our skips >> are >>>>> well documented, so it wouldn't be terrible to have an application in >>> which >>>>> you know you expect to have tons of uncontrolled skips, but it's not >>> great >>>>> either, since you may also have some *unexpected* uncontrolled skips. >>> It'll >>>>> be difficult to notice, since you're probably not alerting on the >> metric >>>>> and filtering out the logs (whatever their level). >>>>> >>>>> I'd recommend any app author, as an alternative, to convert all >> expected >>>>> skips to controlled ones, by updating the topology to filter those >>> records >>>>> out. >>>>> >>>>> Following from my recommendation, as a library author, I'm inclined to >>> mark >>>>> those logs WARN, since in my opinion, they should be concerning to the >>> app >>>>> authors. I'd definitely want to show, rather than hide, them by >>> default, so >>>>> I would pick INFO at least. >>>>> >>>>> That said, logging is always a tricky issue for lower-level libraries >>> that >>>>> run inside user code, since we don't have all the information we need >> to >>>>> make the right call. >>>>> >>>>> >>>>> >>>>> On your last note, yeah, I got that impression from Guozhang as well. >>>>> Thanks for the clarification. >>>>> >>>>> -John >>>>> >>>>> >>>>> >>>>> On Mon, Apr 2, 2018 at 4:03 PM, Matthias J. Sax < >> matth...@confluent.io> >>>>> wrote: >>>>> >>>>>> John, >>>>>> >>>>>> sorry for my late reply and thanks for updating the KIP. >>>>>> >>>>>> I like your approach about "metrics are for monitoring, logs are for >>>>>> debugging" -- however: >>>>>> >>>>>> 1) I don't see a connection between this and the task-level metrics >>> that >>>>>> you propose to get the metrics in `TopologyTestDriver`. I don't think >>>>>> people would monitor the `TopologyTestDriver` an thus wondering why >> it >>>>>> is important to include the metrics there? Thread-level metric might >> be >>>>>> easier to monitor though (ie, less different metric to monitor). >>>>>> >>>>>> 2) I am a little worried about WARN level logging and that it might >> be >>>>>> too chatty -- as you pointed out, it's about debugging, thus DEBUG >>> level >>>>>> might be better. Not 100% sure about this to be honest. What is the >>>>>> general assumption about the frequency for skipped records? I could >>>>>> imagine cases for which skipped records are quite frequent and thus, >>>>>> WARN level logs might "flood" the logs >>>>>> >>>>>> One final remark: >>>>>> >>>>>>> More >>>>>>> generally, I would like to establish a pattern in which we could add >>>>> new >>>>>>> values for the "reason" tags without needing a KIP to do so. >>>>>> >>>>>> From my understanding, this is not feasible. Changing metrics is >> always >>>>>> considered a public API change, and we need a KIP for any change. As >> we >>>>>> moved away from tagging, it doesn't matter for the KIP anymore -- >> just >>>>>> wanted to point it out. >>>>>> >>>>>> >>>>>> -Matthias >>>>>> >>>>>> >>>>>> On 3/30/18 2:47 PM, John Roesler wrote: >>>>>>> Allrighty! The KIP is updated. >>>>>>> >>>>>>> Thanks again, all, for the feedback. >>>>>>> -John >>>>>>> >>>>>>> On Fri, Mar 30, 2018 at 3:35 PM, John Roesler <j...@confluent.io> >>>>> wrote: >>>>>>> >>>>>>>> Hey Guozhang and Bill, >>>>>>>> >>>>>>>> Ok, I'll update the KIP. At the risk of disturbing consensus, I'd >>> like >>>>>> to >>>>>>>> put it in the task instead of the thread so that it'll show up in >> the >>>>>>>> TopologyTestDriver metrics as well. >>>>>>>> >>>>>>>> I'm leaning toward keeping the scope where it is right now, but if >>>>>> others >>>>>>>> want to advocate for tossing in some more metrics, we can go that >>>>> route. >>>>>>>> >>>>>>>> Thanks all, >>>>>>>> -John >>>>>>>> >>>>>>>> On Fri, Mar 30, 2018 at 2:37 PM, Bill Bejeck <bbej...@gmail.com> >>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the KIP John, and sorry for the late comments. >>>>>>>>> >>>>>>>>> I'm on the fence with providing a single level metrics, but I >> think >>>>>> we'll >>>>>>>>> have that discussion outside of this KIP. >>>>>>>>> >>>>>>>>>> * maintain one skipped-record metric (could be per-thread, >>> per-task, >>>>>> or >>>>>>>>>> per-processor-node) with no "reason" >>>>>>>>>> * introduce a warn-level log detailing the topic/partition/offset >>>>> and >>>>>>>>>> reason of the skipped record >>>>>>>>> >>>>>>>>> I'm +1 on both of these suggestions. >>>>>>>>> >>>>>>>>> Finally, we have had requests in the past for some metrics around >>>>> when >>>>>>>>> persistent store removes an expired window. Would adding that to >>> our >>>>>>>>> metrics stretch the scope of this KIP too much? >>>>>>>>> >>>>>>>>> Thanks again and overall I'm +1 on this KIP >>>>>>>>> >>>>>>>>> Bill >>>>>>>>> >>>>>>>>> On Fri, Mar 30, 2018 at 2:00 PM, Guozhang Wang < >> wangg...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The proposal sounds good to me. About "maintain only one level of >>>>>>>>> metrics" >>>>>>>>>> maybe we can discuss about that separately from this KIP since >> that >>>>>>>>> would >>>>>>>>>> be a larger scope of discussion. I agree that if we are going to >>>>>>>>> maintain >>>>>>>>>> only one-level metrics it should be lowest level and we would let >>>>>> users >>>>>>>>> to >>>>>>>>>> do the roll-ups themselves, but I'm still not fully convinced >> that >>>>> we >>>>>>>>>> should just provide single-level metrics, because 1) I think for >>>>>>>>> different >>>>>>>>>> metrics people may be interested to investigate into different >>>>>>>>>> granularities, e.g. for poll / commit rate these are at the >> lowest >>>>>>>>>> task-level metrics, while for process-rate / skip-rate they can >> be >>>>> as >>>>>>>>> low >>>>>>>>>> as processor-node metrics, and 2) user-side rolling ups may not >> be >>>>>> very >>>>>>>>>> straight-forward. But for 2) if someone can provide an efficient >>> and >>>>>>>>> easy >>>>>>>>>> implementation of that I can be persuaded :) >>>>>>>>>> >>>>>>>>>> For now I'm thinking we can add the metric on thread-level, >> either >>>>>> with >>>>>>>>>> finer grained ones with "reason" tag plus an aggregated one >> without >>>>>> the >>>>>>>>>> tag, or just having a single aggregated metric without the tag >>> looks >>>>>>>>> good >>>>>>>>>> to me. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Guozhang >>>>>>>>>> >>>>>>>>>> On Fri, Mar 30, 2018 at 8:05 AM, John Roesler <j...@confluent.io >>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Guozhang, >>>>>>>>>>> >>>>>>>>>>> Thanks for the reply. Regarding JMX, I can dig it. I'll provide >> a >>>>>>>>> list in >>>>>>>>>>> the KIP. I was also thinking we'd better start a documentation >>> page >>>>>>>>> with >>>>>>>>>>> the metrics listed. >>>>>>>>>>> >>>>>>>>>>> I'd have no problem logging a warning when we skip records. On >> the >>>>>>>>> metric >>>>>>>>>>> front, really I'm just pushing for us to maintain only one level >>> of >>>>>>>>>>> metrics. If that's more or less granular (i.e., maybe we don't >>>>> have a >>>>>>>>>>> metric per reason and log the reason instead), that's fine by >> me. >>> I >>>>>>>>> just >>>>>>>>>>> don't think it provides a lot of extra value per complexity >>>>>> (interface >>>>>>>>>> and >>>>>>>>>>> implementation) to maintain roll-ups at the thread level in >>>>> addition >>>>>>>>> to >>>>>>>>>>> lower-level metrics. >>>>>>>>>>> >>>>>>>>>>> How about this instead: >>>>>>>>>>> * maintain one skipped-record metric (could be per-thread, >>>>> per-task, >>>>>>>>> or >>>>>>>>>>> per-processor-node) with no "reason" >>>>>>>>>>> * introduce a warn-level log detailing the >> topic/partition/offset >>>>> and >>>>>>>>>>> reason of the skipped record >>>>>>>>>>> >>>>>>>>>>> If you like that, I can update the KIP. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> -John >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 29, 2018 at 6:22 PM, Guozhang Wang < >>> wangg...@gmail.com >>>>>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on >> coarser >>>>>>>>>>> metrics >>>>>>>>>>>> being easier than finer ones. All the metric alerting systems I >>>>> have >>>>>>>>>> used >>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. >> So >>>>> my >>>>>>>>>>>> experience doesn't say that this is a use case. Were you >> thinking >>>>>>>>> of an >>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable? >>>>>>>>>>>> >>>>>>>>>>>> For the commonly used JMX reporter tags will be encoded >> directly >>>>> as >>>>>>>>>> part >>>>>>>>>>> of >>>>>>>>>>>> the object name, and if users wants to monitor them they need >> to >>>>>>>>> know >>>>>>>>>>> these >>>>>>>>>>>> values before hand. That is also why I think we do want to list >>>>> all >>>>>>>>> the >>>>>>>>>>>> possible values of the reason tags in the KIP, since >>>>>>>>>>>> >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the >>>>>>>>> kind of >>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG on >>>>> all >>>>>>>>> the >>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record >>> once, >>>>>>>>> that >>>>>>>>>>> it >>>>>>>>>>>> would ever happen again. The solution is to capture all the >>>>>>>>> available >>>>>>>>>>>> information about the reason and location of skips all the >> time. >>>>>>>>>>>> >>>>>>>>>>>> That is a good point. I think we can either expose all levels >>>>>>>>> metrics >>>>>>>>>> as >>>>>>>>>>> by >>>>>>>>>>>> default, or only expose the most lower-level metrics and get >> rid >>>>> of >>>>>>>>>> other >>>>>>>>>>>> levels to let users do roll-ups themselves (which will be a >> much >>>>>>>>> larger >>>>>>>>>>>> scope for discussion), or we can encourage users to not purely >>>>>>>>> depend >>>>>>>>>> on >>>>>>>>>>>> metrics for such trouble shooting: that is to say, users only >> be >>>>>>>>>> alerted >>>>>>>>>>>> based on metrics, and we can log a info / warn log4j entry each >>>>>>>>> time we >>>>>>>>>>> are >>>>>>>>>>>> about to skip a record all over the places, so that upon being >>>>>>>>> notified >>>>>>>>>>>> users can look into the logs to find the details on where / >> when >>>>> it >>>>>>>>>>>> happens. WDYT? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Guozhang >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 29, 2018 at 3:57 PM, John Roesler < >> j...@confluent.io >>>> >>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey Guozhang, >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for the review. >>>>>>>>>>>>> >>>>>>>>>>>>> 1. >>>>>>>>>>>>> Matthias raised the same question about the "reason" tag >> values. >>>>> I >>>>>>>>>> can >>>>>>>>>>>> list >>>>>>>>>>>>> all possible values of the "reason" tag, but I'm thinking this >>>>>>>>> level >>>>>>>>>> of >>>>>>>>>>>>> detail may not be KIP-worthy, maybe the code and documentation >>>>>>>>> review >>>>>>>>>>>> would >>>>>>>>>>>>> be sufficient. If you all disagree and would like it included >> in >>>>>>>>> the >>>>>>>>>>>> KIP, I >>>>>>>>>>>>> can certainly do that. >>>>>>>>>>>>> >>>>>>>>>>>>> If we do provide roll-up metrics, I agree with the pattern of >>>>>>>>> keeping >>>>>>>>>>> the >>>>>>>>>>>>> same name but eliminating the tags for the dimensions that >> were >>>>>>>>>>>> rolled-up. >>>>>>>>>>>>> >>>>>>>>>>>>> 2. >>>>>>>>>>>>> I'm not too sure that implementation efficiency really >> becomes a >>>>>>>>>> factor >>>>>>>>>>>> in >>>>>>>>>>>>> choosing whether to (by default) update one coarse metric at >> the >>>>>>>>>> thread >>>>>>>>>>>>> level or one granular metric at the processor-node level, >> since >>>>>>>>> it's >>>>>>>>>>> just >>>>>>>>>>>>> one metric being updated either way. I do agree that if we >> were >>>>> to >>>>>>>>>>> update >>>>>>>>>>>>> the granular metrics and multiple roll-ups, then we should >>>>>>>>> consider >>>>>>>>>> the >>>>>>>>>>>>> efficiency. >>>>>>>>>>>>> >>>>>>>>>>>>> I agree it's probably not necessary to surface the metrics for >>>>> all >>>>>>>>>>> nodes >>>>>>>>>>>>> regardless of whether they can or do skip records. Perhaps we >>> can >>>>>>>>>>> lazily >>>>>>>>>>>>> register the metrics. >>>>>>>>>>>>> >>>>>>>>>>>>> In my email in response to Matthias, I gave an example of the >>>>>>>>> kind of >>>>>>>>>>>>> scenario that would lead me as an operator to run with DEBUG >> on >>>>>>>>> all >>>>>>>>>> the >>>>>>>>>>>>> time, since I wouldn't be sure, having seen a skipped record >>>>> once, >>>>>>>>>> that >>>>>>>>>>>> it >>>>>>>>>>>>> would ever happen again. The solution is to capture all the >>>>>>>>> available >>>>>>>>>>>>> information about the reason and location of skips all the >> time. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> One thing you mention is the notion of setting alerts on >> coarser >>>>>>>>>>> metrics >>>>>>>>>>>>> being easier than finer ones. All the metric alerting systems >> I >>>>>>>>> have >>>>>>>>>>> used >>>>>>>>>>>>> make it equally easy to alert on metrics by-tag or over tags. >> So >>>>>>>>> my >>>>>>>>>>>>> experience doesn't say that this is a use case. Were you >>> thinking >>>>>>>>> of >>>>>>>>>> an >>>>>>>>>>>>> alerting system that makes such a pre-aggregation valuable? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks again, >>>>>>>>>>>>> -John >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Mar 29, 2018 at 5:24 PM, Guozhang Wang < >>>>>>>>> wangg...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello John, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the KIP. Some comments: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Could you list all the possible values of the "reason" >> tag? >>>>>>>>> In >>>>>>>>>> the >>>>>>>>>>>>> JIRA >>>>>>>>>>>>>> ticket I left some potential reasons but I'm not clear if >>> you're >>>>>>>>>>> going >>>>>>>>>>>> to >>>>>>>>>>>>>> categorize each of them as a separate reason, or is there any >>>>>>>>>>>> additional >>>>>>>>>>>>>> ones you have in mind. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also I'm wondering if we should add another metric that do >> not >>>>>>>>> have >>>>>>>>>>> the >>>>>>>>>>>>>> reason tag but aggregates among all possible reasons? This is >>>>>>>>> for >>>>>>>>>>> users >>>>>>>>>>>>> to >>>>>>>>>>>>>> easily set their alerting notifications (otherwise they have >> to >>>>>>>>>> write >>>>>>>>>>>> on >>>>>>>>>>>>>> notification rule per reason) in their monitoring systems. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. Note that the processor-node metrics is actually >>> "per-thread, >>>>>>>>>>>>> per-task, >>>>>>>>>>>>>> per-processor-node", and today we only set the per-thread >>>>>>>>> metrics >>>>>>>>>> as >>>>>>>>>>>> INFO >>>>>>>>>>>>>> while leaving the lower two layers as DEBUG. I agree with >> your >>>>>>>>>>> argument >>>>>>>>>>>>>> that we are missing the per-client roll-up metrics today, but >>>>>>>>> I'm >>>>>>>>>>>>> convinced >>>>>>>>>>>>>> that the right way to approach it would be >>>>>>>>>>> "just-providing-the-lowest- >>>>>>>>>>>>>> level >>>>>>>>>>>>>> metrics only". >>>>>>>>>>>>>> >>>>>>>>>>>>>> Note the recoding implementation of these three levels are >>>>>>>>>> different >>>>>>>>>>>>>> internally today: we did not just do the rolling up to >> generate >>>>>>>>> the >>>>>>>>>>>>>> higher-level metrics from the lower level ones, but we just >>>>>>>>> record >>>>>>>>>>> them >>>>>>>>>>>>>> separately, which means that, if we turn on multiple levels >> of >>>>>>>>>>> metrics, >>>>>>>>>>>>> we >>>>>>>>>>>>>> maybe duplicate collecting some metrics. One can argue that >> is >>>>>>>>> not >>>>>>>>>>> the >>>>>>>>>>>>> best >>>>>>>>>>>>>> way to represent multi-level metrics collecting and >> reporting, >>>>>>>>> but >>>>>>>>>> by >>>>>>>>>>>>> only >>>>>>>>>>>>>> enabling thread-level metrics as INFO today, that >>> implementation >>>>>>>>>>> could >>>>>>>>>>>> be >>>>>>>>>>>>>> more efficient than only collecting the metrics at the lowest >>>>>>>>>> level, >>>>>>>>>>>> and >>>>>>>>>>>>>> then do the roll-up calculations outside of the metrics >>> classes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Plus, today not all processor-nodes may possibly skip >> records, >>>>>>>>>> AFAIK >>>>>>>>>>> we >>>>>>>>>>>>>> will only skip records at the source, sink, window and >>>>>>>>> aggregation >>>>>>>>>>>>>> processor nodes, so adding a metric per processor looks like >> an >>>>>>>>>>>> overkill >>>>>>>>>>>>> to >>>>>>>>>>>>>> me as well. On the other hand, from user's perspective the >>>>>>>>> "reason" >>>>>>>>>>> tag >>>>>>>>>>>>> may >>>>>>>>>>>>>> be sufficient for them to narrow down where inside the >> topology >>>>>>>>> is >>>>>>>>>>>>> causing >>>>>>>>>>>>>> records to be dropped on the floor. So I think the >> "per-thread, >>>>>>>>>>>> per-task" >>>>>>>>>>>>>> level metrics should be sufficient for them in trouble shoot >> in >>>>>>>>>> DEBUG >>>>>>>>>>>>> mode, >>>>>>>>>>>>>> and we can add another "per-thread" level metrics as INFO >> which >>>>>>>>> is >>>>>>>>>>>> turned >>>>>>>>>>>>>> on by default. So under normal execution users still only >> need >>>>>>>>> INFO >>>>>>>>>>>> level >>>>>>>>>>>>>> metrics for alerting (e.g. set alerts on all skipped-records >>>>>>>>>> metrics >>>>>>>>>>> as >>>>>>>>>>>>>> non-zero), and then upon trouble shooting they can turn on >>> DEBUG >>>>>>>>>>>> metrics >>>>>>>>>>>>> to >>>>>>>>>>>>>> look into which task is actually causing the skipped records. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Guozhang >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Mar 29, 2018 at 2:03 PM, Matthias J. Sax < >>>>>>>>>>>> matth...@confluent.io> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for the KIP John. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Reading the material on the related Jiras, I am wondering >> what >>>>>>>>>>>> `reason` >>>>>>>>>>>>>>> tags you want to introduce? Can you elaborate? The KIP >> should >>>>>>>>>> list >>>>>>>>>>>>> those >>>>>>>>>>>>>>> IMHO. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> About the fine grained metrics vs the roll-up: you say that >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> the coarse metric aggregates across two dimensions >>>>>>>>>> simultaneously >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you elaborate why this is an issue? I am not convinced >> atm >>>>>>>>>> that >>>>>>>>>>>> we >>>>>>>>>>>>>>> should put the fine grained metrics into INFO level and >> remove >>>>>>>>>> the >>>>>>>>>>>>>>> roll-up at thread level. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Given that they have to do this sum to get a usable >>>>>>>>> top-level >>>>>>>>>>> view >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is a fair concern, but I don't share the conclusion. >>>>>>>>>> Offering >>>>>>>>>>> a >>>>>>>>>>>>>>> built-in `KafkaStreams` "client" roll-up out of the box >> might >>>>>>>>> be >>>>>>>>>> a >>>>>>>>>>>>>>> better solution. In the past we did not offer this due to >>>>>>>>>>> performance >>>>>>>>>>>>>>> concerns, but we could allow an "opt-in" mechanism. If you >>>>>>>>>>> disagree, >>>>>>>>>>>>> can >>>>>>>>>>>>>>> you provide some reasoning and add them to the "Rejected >>>>>>>>>>>> alternatives" >>>>>>>>>>>>>>> section. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To rephrase: I understand the issue about missing top-level >>>>>>>>> view, >>>>>>>>>>> but >>>>>>>>>>>>>>> instead of going more fine grained, we should consider to >> add >>>>>>>>>> this >>>>>>>>>>>>>>> top-level view and add/keep the fine grained metrics at >> DEBUG >>>>>>>>>> level >>>>>>>>>>>>> only >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am +1 to add TopologyTestDriver#metrics() and to remove >> old >>>>>>>>>>> metrics >>>>>>>>>>>>>>> directly as you suggested. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Matthias >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 3/28/18 6:42 PM, Ted Yu wrote: >>>>>>>>>>>>>>>> Looks good to me. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Mar 28, 2018 at 3:11 PM, John Roesler < >>>>>>>>>> j...@confluent.io >>>>>>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hello all, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am proposing KIP-274 to improve the metrics around >>>>>>>>> skipped >>>>>>>>>>>> records >>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>> Streams. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please find the details here: >>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- >>>>>>>>>>>>>>>>> 274%3A+Kafka+Streams+Skipped+Records+Metrics >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Please let me know what you think! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> -John >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> -- Guozhang >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> -- Guozhang >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> -- Guozhang >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>> >>> >> > > >
signature.asc
Description: OpenPGP digital signature