Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

John Roesler Wed, 13 May 2020 15:34:41 -0700

Hello all, and thanks for the KIP, Sophie,

Just some comments on the discussion so far:


B2/G1:
In principle, it shouldn't matter whether we report "spans" or
"end-to-end" latency. But in practice, some of the spans are pretty
difficult to really measure (like time spent waiting in the topics, or 
time from the event happening to the ETL producer choosing to send it,
or time spent in send/receive buffers, etc., etc.

In other words, it's practically easier to compute spans by subtracting
e2e latencies than it is to compute e2e latencies by adding spans. You
can even just consider that the span computation from e2e always just
involves subtracting two numbers, whereas computing e2e latency from
spans involves adding _all_ the spans leading up to the end you care about.

It seems like people really prefer to have spans when they are debugging
latency problems, whereas e2e latency is a more general measurement
that basically every person/application cares about and should be
monitoring.

Altogether, it really seem to provide more value to more people if we report
e2e latencies. Regarding "record-staleness" as a name, I think I have no
preference, I'd defer to other peoples' intuition.

G2:
I think the processor-node metric is nice, since the inside of a task can
introduce a significant amount of latency in some cases. Plus, it's a more
direct measurement, if you really wanted to know (for the purposes of IQ
or something) how long it takes source events to "show up" at the store.

I think actually recording it at every processor could be expensive, but we
already record a bunch of metrics at the node level.

B1:
I think 50% could be reasonable to record also. Even if it's a poor metric
for operational purposes, a lot of people might expect to see "mean". Actually,
I was surprised not to see "min". Is there a reason to leave it off?

I might suggest:
min, mean (50th), 75th, 99th, max

B3:
I agree we should include late records (though not the ones we drop).
It may be spiky, but only when there are legitimately some records with a
high end-to-end latency, which is the whole point of these metrics.

That's it! I don't think I have any other feedback, other than a request to
also report "min".

Thanks,
-John

On Wed, May 13, 2020, at 16:58, Guozhang Wang wrote:
> Thanks Sophie for the KIP, a few quick thoughts:
> 
> 1) The end-to-end latency includes both the processing latency of the task
> and the latency spent sitting in intermediate topics. I have a similar
> feeling as Boyang mentioned above that the latency metric of a task A
> actually measures the latency of the sub-topology up-to but not including
> the processing of A, which is a bit weird.
> 
> Maybe the my feeling comes from the name "latency" itself, since today we
> already have several "latency" metrics already which are measuring elapsed
> system-time for processing a record / etc, while here we are comparing the
> system wallclock time with the record timestamp.
> 
> Maybe we can consider renaming it as "record-staleness" (note we already
> have a "record-lateness" metric), in which case recording at the
> system-time before we start processing the record sounds more natural.
> 
> 2) With that in mind, I'm wondering if the processor-node-level DEBUG
> metric is worth to add, given that we already have a task-level processing
> latency metric. Basically, a specific node's e2e latency is similar to the
> task-level e2e latency + task-level processing latency. Personally I think
> having a task-level record-staleness metric is sufficient.
> 
> 
> 
> Guozhang
> 
> 
> 
> On Wed, May 13, 2020 at 11:46 AM Sophie Blee-Goldman <sop...@confluent.io>
> wrote:
> 
> > 1. I felt that 50% was not a particularly useful gauge for this specific
> > metric, as
> > it's presumably most useful at putting an *upper *bound on the latency you
> > can
> > reasonably expect to see. I chose percentiles that would hopefully give a
> > good
> > sense of what *most* records will experience, and what *close to all*
> > records
> > will.
> >
> > However I'm not married to these specific numbers and could be convinced.
> > Would be especially interested in hearing from users on this.
> >
> > 2. I'm inclined to not include the "hop-to-hop latency" in this KIP since
> > users
> > can always compute it themselves by subtracting the previous node's
> > end-to-end latency. I guess we could do it either way since you can always
> > compute one from the other, but I think the end-to-end latency feels more
> > valuable as it's main motivation is not to debug bottlenecks in the
> > topology but
> > to give users a sense of how long it takes arecord to be reflected in
> > certain parts
> > of the topology. For example this might be useful for users who are
> > wondering
> > roughly when a record that was just produced will be included in their IQ
> > results.
> > Debugging is just a nice side effect -- but maybe I didn't make that clear
> > enough
> > in the KIP's motivation.
> >
> > 3. Good question, I should address this in the KIP. The short answer is
> > "yes",
> > we will include late records. I added a paragraph to the end of the
> > Proposed
> > Changes section explaining the reasoning here, please let me know if you
> > have
> > any concerns.
> >
> > 4. Assuming you're referring to the existing metric "process-latency", that
> > metric
> > reflects the time for the literal Node#process method to run whereas this
> > metric
> > would always be measured relative to the event timestamp.
> >
> > That said, the naming collision there is pretty confusing so I've renamed
> > the
> > metrics in this KIP to "end-to-end-latency" which I feel better reflects
> > the nature
> > of the metric anyway.
> >
> > Thanks for the feedback!
> >
> > On Wed, May 13, 2020 at 10:21 AM Boyang Chen <reluctanthero...@gmail.com>
> > wrote:
> >
> > > Thanks for the KIP Sophie. Getting the E2E latency is important for
> > > understanding the bottleneck of the application.
> > >
> > > A couple of questions and ideas:
> > >
> > > 1. Could you clarify the rational of picking 75, 99 and max percentiles?
> > > Normally I see cases where we use 50, 90 percentile as well in production
> > > systems.
> > >
> > > 2. The current latency being computed is cumulative, I.E if a record goes
> > > through A -> B -> C, then P(C) = T(B->C) + P(B) = T(B->C) + T(A->B) +
> > T(A)
> > > and so on, where P() represents the captured latency, and T() represents
> > > the time for transiting the records between two nodes, including
> > processing
> > > time. For monitoring purpose, maybe having T(B->C) and T(A->B) are more
> > > natural to view as "hop-to-hop latency", otherwise if there is a spike in
> > > T(A->B), both P(B) and P(C) are affected in the same time. In the same
> > > spirit, the E2E latency is meaningful only when the record exits from the
> > > sink as this marks the whole time this record spent inside the funnel. Do
> > > you think we could have separate treatment for sink nodes and other
> > > nodes, so that other nodes only count the time receiving the record from
> > > last hop? I'm not proposing a solution here, just want to discuss this
> > > alternative to see if it is reasonable.
> > >
> > > 3. As we are going to monitor late arrival records as well, they would
> > > create some really spiky graphs when the out-of-order records are
> > > interleaving with on time records. Should we also supply a smooth version
> > > of the latency metrics, or user should just take care of it by themself?
> > >
> > > 4. Regarding this new metrics, we haven't discussed its relation with our
> > > existing processing latency metrics, could you add some context on
> > > comparison and a simple `when to use which` tutorial for the best?
> > >
> > > Boyang
> > >
> > > On Tue, May 12, 2020 at 7:28 PM Sophie Blee-Goldman <sop...@confluent.io
> > >
> > > wrote:
> > >
> > > > Hey all,
> > > >
> > > > I'd like to kick off discussion on KIP-613 which aims to add end-to-end
> > > > latency metrics to Streams. Please take a look:
> > > >
> > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-613%3A+Add+end-to-end+latency+metrics+to+Streams
> > > >
> > > > Cheers,
> > > > Sophie
> > > >
> > >
> >
> 
> 
> -- 
> -- Guozhang
>

Re: [DISCUSS] KIP-613: Add end-to-end latency metrics to Streams

Reply via email to