Thanks for the crash course in statistical terms :) In light of this I definitely support Cumulative{Sum,Count}, but I'm really not crazy about SimpleWindowed{Sum,Count} (vs just Windowed). Not so much because of its unfortunate length (although that is unfortunate it shouldn't be a deciding factor) but because it seems to have the potential to confuse further. I'm not sure what we gain by adding "Simple" since to me at least, the unweighted-ness is obvious and the definition of simple is not. To those who haven't been exposed to the finer details of statistical definitions, I think they are more likely to read "SimpleXX" and wonder "is there an 'advanced' or non-simple kind of Windowed?" than they are to wonder what is the weighting behind these metrics.
Sophie On Wed, Jul 17, 2019 at 8:18 AM John Roesler <j...@confluent.io> wrote: > Thanks for the discussion, all. > > I've done a little more research into the statistical terminology. > Matthias is correct, "running" and "moving" appear to be synonyms. > Unfortunately, both can be computed either over a window of the last N > measurements or over all prior measurements. "Moving" just signifies > that the statistic is computed over a "live" data set, i.e., a > continuous stream of measurements, and the expectation is that the > stat would be updated in response to new measurements. > > I found https://en.wikipedia.org/wiki/Moving_average to have a pretty > good overview of the whole picture. > > After considering the discussion so far and some light reading, it > seems like "Cumulative" is truly the correct term for the all-time > metrics: > > > In a cumulative moving average, the data arrive in an > > ordered datum stream, and the user would like to get > > the average of all of the data up until the current datum > > point. > > I know that we previously felt that "cumulative" was too much of a > mouthful, but it seems like our quest for a terser term led us into a > briar patch. Also, now there is an independent source (the wiki page) > indicating that this is indeed the correct term, and it doesn't offer > any synonyms to choose from. Maybe we can take comfort in the fact > that we'll rarely be saying the name of the classes out loud. > > As far as moving stats that operate over a window of the last N > measurements, there are multiple options, including Simple > (unweighted), Weighted, and Exponentially Weighted, and presumably > infinite variations with other weighting functions. In our domain, > there is only one weighting function available, but it's still more > self-documenting and future-proof to specify the type of windowed > statistic. Therefore, I'm proposing "Simple" as the term for the > windowed (aka sampled) stats, while keeping Windowed in the name to > distinguish it from the all-time metrics. > > > In financial applications a simple moving average (SMA) > > is the unweighted mean of the previous n data. > > Therefore, we would have the proposed matrix: > > SimpleWindowedSum, SimpleWindowedCount > CumulativeSum, CumulativeCount > > Again, all these proposed names are less pithy than we might wish, but > the whole point of this exercise is to demystify and disambiguate > them. It seems like the discussion so far illustrates the futility of > trying to find names that are both short and descriptive. > > How does that sound? > -John > > On Tue, Jul 16, 2019 at 4:43 PM Matthias J. Sax <matth...@confluent.io> > wrote: > > > > It's a fair point that Ryanne raises. However, "running sum" is the same > > as "moving sum" from my understanding. > > > > The issue is still, that `Sum` and `Count` which seem to be the cleanest > > names cannot be used. While I agree that `TotalSum` and `TotalCount` is > > somewhat redundant, I still think it the best suggestion so far. > > > > For the "sampled" version, I am personally fine with either `MovingXxx`, > > `WindowedXxx`, or `RunningXxx` -- to me, that are all equally good to > > describe the semantics. > > > > > > > > -Matthias > > > > > > > > On 7/16/19 2:25 PM, Sophie Blee-Goldman wrote: > > > I'm +1 on Windowed, was about to suggest that as I was catching up on > the > > > discussion but Bill beat me to it :) > > > > > > On Tue, Jul 16, 2019 at 2:23 PM Bill Bejeck <bbej...@gmail.com> wrote: > > > > > >> Hi John, > > >> > > >> Thanks for the updates. > > >> > > >> I like RunningCount and RunningSum. > > >> > > >> What about WindowedCount, WindowedSum instead of Moving? > > >> I'm just throwing this out there as Windowed seems more intuitive to > me, > > >> but I'm not married to the idea. > > >> > > >> -Bill > > >> > > >> On Tue, Jul 16, 2019 at 5:09 PM John Roesler <j...@confluent.io> > wrote: > > >> > > >>> No worries! Choosing good public API names is a high-impact design > > >>> activity. > > >>> > > >>> Matthias, Bruno, Bill, and Stanislav, > > >>> > > >>> You've all contributed to this discussion or the vote so far... How > do > > >>> you feel about the proposed name change: > > >>> > > >>> MovingCount, MovingSum (instead of Sampled) > > >>> RunningCount, RunningSum (Instead of Total) > > >>> > > >>> Thanks, > > >>> -John > > >>> > > >>> On Tue, Jul 16, 2019 at 3:04 PM Ryanne Dolan <ryannedo...@gmail.com> > > >>> wrote: > > >>>> > > >>>> John, that makes sense to me. Sorry for the bikeshedding. > > >>>> > > >>>> Ryanne > > >>>> > > >>>> On Tue, Jul 16, 2019 at 12:49 PM John Roesler <j...@confluent.io> > > >> wrote: > > >>>> > > >>>>> Thanks for the explanation and the suggestion, Ryanne, > > >>>>> > > >>>>> I went with "sampled" just because these are instances of > > >> SampledStat, > > >>>>> which in the Kafka Metrics ecosystem are computed from a window of > > >>>>> recent samples. Thinking more about it, the fact that they are > > >> sampled > > >>>>> and the fact that they are windowed are orthogonal, which is what > > >>>>> you're pointing out... sampling by itself doesn't indicate that > it's > > >> a > > >>>>> moving average. > > >>>>> > > >>>>> Since there is no way in Kafka Metrics for a metric to be sampled > and > > >>>>> not windowed/moving/decaying, calling them Sampled would never be > > >>>>> incorrect. But to someone unfamiliar with the code, it wouldn't > > >>>>> immediately suggest the behavior of the metric that actually > matters. > > >>>>> That is, the behavior that distinguishes the two classes of metrics > > >> we > > >>>>> want to disambiguate here. > > >>>>> > > >>>>> It sounds like you'd suggest a new matrix of names: > > >>>>> MovingCount, MovingSum > > >>>>> RunningCount, RunningSum > > >>>>> > > >>>>> Are these names unambiguous and self explanatory? > > >>>>> > > >>>>> Thanks, > > >>>>> -John > > >>>>> > > >>>>> On Tue, Jul 16, 2019 at 12:32 PM Ryanne Dolan < > ryannedo...@gmail.com > > >>> > > >>>>> wrote: > > >>>>>> > > >>>>>>> measurements, which decay/expire over time > > >>>>>> > > >>>>>> Thanks John for the clarification. This was my (re-)reading of the > > >>> code, > > >>>>>> but this is not what I think of when I hear "sampled". In fact, > > >>> you'll > > >>>>>> notice that the Wikipedia pages for "Sample (statistics)" and > > >> "Sample > > >>>>>> (signal processing)" do not contain the words decay, expire, > > >> recent, > > >>>>>> history, or anything similar. > > >>>>>> > > >>>>>> Similar to "running", I'd suggest the more correct "moving", as in > > >>>>> "moving > > >>>>>> average" and "moving sum", which involve looking back N samples, > > >>>>> applying a > > >>>>>> sliding window, decaying over time etc. > > >>>>>> > > >>>>>> Ryanne > > >>>>>> > > >>>>>> On Tue, Jul 16, 2019, 11:58 AM John Roesler <j...@confluent.io> > > >>> wrote: > > >>>>>> > > >>>>>>> Thanks for raising this concern, Ryanne, > > >>>>>>> > > >>>>>>> "Sampled" indicates that the metrics is sampled, namely that we > > >>>>>>> maintain a set of samples from recent value measurements, which > > >>>>>>> decay/expire over time. So, the metric value is only > > >>> representative of > > >>>>>>> the recent past. > > >>>>>>> > > >>>>>>> "Total" indicates that the metric value contains all the > > >>> information > > >>>>>>> from the creation of the metric. For example., the total sum > > >> would > > >>>>>>> include all measurements since the app started up. > > >>>>>>> > > >>>>>>> It seems like your concern is that the word "total" doesn't > > >> really > > >>>>>>> pinpoint this meaning, which is true. It's especially confusing > > >>> that > > >>>>>>> another meaning of "total" is synonymous with "sum", rendering > > >> the > > >>>>>>> name "TotalSum" sort of absurd. > > >>>>>>> > > >>>>>>> We previously considered "cumulative", which was rejected as a > > >>>>>>> mouthful (it's four syllables) . > > >>>>>>> > > >>>>>>> You mentioned "running", which might be a more appropriate > > >> modifier > > >>>>>>> (RunningSum and RunningCount). > > >>>>>>> > > >>>>>>> What would everyone think about that? > > >>>>>>> > > >>>>>>> Thanks, > > >>>>>>> -John > > >>>>>>> > > >>>>>>> On Tue, Jul 16, 2019 at 11:27 AM Ryanne Dolan < > > >>> ryannedo...@gmail.com> > > >>>>>>> wrote: > > >>>>>>>> > > >>>>>>>> John, I mentioned on the VOTE thread that the proposed names > > >> are > > >>> a > > >>>>> bit > > >>>>>>>> confusing, > > >>>>>>>> > > >>>>>>>>> given that "sum", "total", and "count" are roughly > > >>> synonymous... > > >>>>>>>> > > >>>>>>>> In particular, TotalSum is, I think, a "running total", though > > >>> the > > >>>>> naming > > >>>>>>>> doesn't really capture that. > > >>>>>>>> > > >>>>>>>> I think to avoid confusion, we should define exactly what > > >>> "total" and > > >>>>>>>> "sampled" are supposed to indicate, and perhaps pick > > >> appropriate > > >>>>> naming > > >>>>>>>> from there. > > >>>>>>>> > > >>>>>>>> Ryanne > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On Fri, Jul 12, 2019 at 1:42 PM John Roesler < > > >> j...@confluent.io> > > >>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Hey, thanks Matthias and Bruno, > > >>>>>>>>> > > >>>>>>>>> I agree, "Cumulative" is a mouthful. "TotalX" sounds fine to > > >>> me. > > >>>>>>>>> > > >>>>>>>>> Also, yes, I would have liked to not have any modifier for > > >>>>>>>>> "non-sampled", but there is a name conflict with Sum. > > >>>>>>>>> > > >>>>>>>>> I'll update the KIP to reflect "TotalX" and then start the > > >> vote > > >>>>> thread. > > >>>>>>>>> > > >>>>>>>>> Thanks again, > > >>>>>>>>> -John > > >>>>>>>>> > > >>>>>>>>> On Fri, Jul 12, 2019 at 11:27 AM Bruno Cadonna < > > >>> br...@confluent.io > > >>>>>> > > >>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> OK, makes sense. Then, I am in favour of TotalCount and > > >>> TotalSum. > > >>>>>>>>>> > > >>>>>>>>>> Best, > > >>>>>>>>>> Bruno > > >>>>>>>>>> > > >>>>>>>>>> On Fri, Jul 12, 2019 at 12:57 AM Matthias J. Sax < > > >>>>>>> matth...@confluent.io> > > >>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> `Sum` is an existing name, for the "sampled sum" metric, > > >>> that > > >>>>> gets > > >>>>>>>>>>> deprecated. Hence, we cannot use it. > > >>>>>>>>>>> > > >>>>>>>>>>> If we cannot use `Sum` and use `TotalSum`, we should also > > >>> not > > >>>>> use > > >>>>>>>>>>> `Count` but `TotalCount` for consistency. > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -Matthias > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On 7/11/19 12:58 PM, Bruno Cadonna wrote: > > >>>>>>>>>>>> Hi John, > > >>>>>>>>>>>> > > >>>>>>>>>>>> Thank you for the KIP. > > >>>>>>>>>>>> > > >>>>>>>>>>>> LGTM > > >>>>>>>>>>>> > > >>>>>>>>>>>> I also do not like CumulativeSum/Count so much. I > > >>> propose to > > >>>>> just > > >>>>>>>>> call > > >>>>>>>>>>>> it Sum and Count. > > >>>>>>>>>>>> > > >>>>>>>>>>>> I understand that you want to unequivocally distinguish > > >>> the > > >>>>> two > > >>>>>>>>> metric > > >>>>>>>>>>>> functions by their names, but I have the feeling the > > >>> names > > >>>>> become > > >>>>>>>>>>>> artificially complex. The exact semantics can also be > > >>>>> documented > > >>>>>>> in > > >>>>>>>>>>>> the javadocs, which btw could also be improved in those > > >>>>> classes. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Best, > > >>>>>>>>>>>> Bruno > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Thu, Jul 11, 2019 at 8:25 PM Matthias J. Sax < > > >>>>>>>>> matth...@confluent.io> wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks for the KIP. Overall LGTM. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> The only though I have is, if we may want to use > > >>> `TotalSum` > > >>>>> and > > >>>>>>>>>>>>> `TotalCount` instead of `CumulativeSum/Count` as > > >> names? > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> -Matthias > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On 7/11/19 9:31 AM, John Roesler wrote: > > >>>>>>>>>>>>>> Hi Kafka devs, > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I'd like to propose KIP-488 as a minor cleanup of > > >> some > > >>> of > > >>>>> our > > >>>>>>>>> metric > > >>>>>>>>>>>>>> implementations. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> KIP-488: > > >> https://cwiki.apache.org/confluence/x/kkAyBw > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Over time, iterative updates to these metrics has > > >>> resulted > > >>>>> in a > > >>>>>>>>> pretty > > >>>>>>>>>>>>>> confusing little collection of classes, and I've > > >>> personally > > >>>>>>> been > > >>>>>>>>>>>>>> involved in three separate moderately time-consuming > > >>>>>>> iterations of > > >>>>>>>>> me > > >>>>>>>>>>>>>> or someone else trying to work out which metrics are > > >>>>>>> available, and > > >>>>>>>>>>>>>> which ones are desired for a given use case. One of > > >>> these > > >>>>> was > > >>>>>>>>> actually > > >>>>>>>>>>>>>> a long-running bug in Kafka Streams' metrics, so not > > >>> only > > >>>>> has > > >>>>>>> this > > >>>>>>>>>>>>>> confusion been a time sink, but it has also led to > > >>> bugs. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> I'm hoping this change won't be too controversial. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>> -John > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>> > > >>> > > >> > > > > > >