rzo1 commented on issue #8583:
URL: https://github.com/apache/storm/issues/8583#issuecomment-4363421932
Thanks for the detailed write-up - the motivation is solid and the RFC 1889
EWMA is genuinely cheap, so the cost
side of the proposal is hard to argue with. A few things I'd like to see
clarified before this moves to a PR:
1. Relationship to existing Dropwizard metrics. Storm's metrics v2 is
built on Dropwizard, and the existing
execute-latency / process-latency / complete-latency timers are already
Histogram/Timer instances with reservoir
sampling. That gives us P99/P999, stddev, and mean for free, and they flow
through the existing reporters
(Prometheus, Graphite, etc.). Could you spell out what EWMA-of-abs-diff
gives operators that the histogram
percentiles don't? The RFC 1889 form is a great fit for an RTP receiver
computing inter-arrival jitter on the fly,
but operators consuming Storm metrics typically want tail percentiles, and
the two are not the same statistic. If
the answer is "cheaper than a reservoir," it's worth quantifying -
Dropwizard's exponentially decaying reservoir is
also quite light.
2. Definition. "Jitter" gets read three different ways in practice: (a)
RFC 1889 smoothed inter-arrival deviation,
(b) latency standard deviation, (c) tail spread (P99 − P50). The proposal
picks (a), which is fine, but the
doc/metric name should be explicit so users don't assume (b) or (c),
right?
3. Smoothing factor. α = 1/16 is tuned for RTP's packet cadence. Storm
tuple rates span many orders of magnitude:
at 1M tuples/sec α=1/16 averages over microseconds of wall clock, at
10/sec over seconds. This should be
configurable per topology (or per component), not a constant.
4. Integration point. The line "hooks into existing latency tracking
logic" is doing a lot of work. Could you sketch
where exactly BoltExecutor / SpoutExecutor latency sampling paths, the
stats BoltExecutorStats /
SpoutExecutorStats, or a parallel listener? That's the part that
determines whether this is a 50-line change or a
much larger one, and it'd be good to align on it before any code lands.
5. Acked-only e2e jitter. Restricting global jitter to fully-acked tuples
introduces a selection bias: failed and
timed-out tuples are often exactly the ones with pathological latency.
Worth either including them or documenting
the bias clearly.
6. Clock skew. "Constant skew cancels mathematically" is true, but e2e
measurements cross workers/nodes where the
relevant problem is drift, not constant offset. NTP-synced clocks still
drift on the order of ms, which can dominate
the jitter signal at low latencies. Worth calling out as a real
limitation rather than a dismissed one (at least in the documentation of such a
feature)
Happy to help review a PR once the integration point and the
EWMA-vs-histogram-percentile question are settled.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]