rzo1 commented on issue #8583:
URL: https://github.com/apache/storm/issues/8583#issuecomment-4363421932

    Thanks for the detailed write-up - the motivation is solid and the RFC 1889 
EWMA is genuinely cheap, so the cost
     side of the proposal is hard to argue with. A few things I'd like to see 
clarified before this moves to a PR:       
                                                                                
                                  
     1. Relationship to existing Dropwizard metrics. Storm's metrics v2 is 
built on Dropwizard, and the existing         
     execute-latency / process-latency / complete-latency timers are already 
Histogram/Timer instances with reservoir    
     sampling. That gives us P99/P999, stddev, and mean for free, and they flow 
through the existing reporters
     (Prometheus, Graphite, etc.). Could you spell out what EWMA-of-abs-diff 
gives operators that the histogram          
     percentiles don't? The RFC 1889 form is a great fit for an RTP receiver 
computing inter-arrival jitter on the fly,
     but operators consuming Storm metrics typically want tail percentiles, and 
the two are not the same statistic. If
     the answer is "cheaper than a reservoir," it's worth quantifying - 
Dropwizard's exponentially decaying reservoir is
     also quite light.
   
     2. Definition. "Jitter" gets read three different ways in practice: (a) 
RFC 1889 smoothed inter-arrival deviation,  
     (b) latency standard deviation, (c) tail spread (P99 − P50). The proposal 
picks (a), which is fine, but the
     doc/metric name should be explicit so users don't assume (b) or (c), 
right?                                   
                     
     3. Smoothing factor. α = 1/16 is tuned for RTP's packet cadence. Storm 
tuple rates span many orders of magnitude: 
     at 1M tuples/sec α=1/16 averages over microseconds of wall clock, at 
10/sec over seconds. This should be
     configurable per topology (or per component), not a constant.              
                                         
                     
     4. Integration point. The line "hooks into existing latency tracking 
logic" is doing a lot of work. Could you sketch
      where exactly BoltExecutor / SpoutExecutor latency sampling paths, the 
stats BoltExecutorStats /
     SpoutExecutorStats, or a parallel listener? That's the part that 
determines whether this is a 50-line change or a   
     much larger one, and it'd be good to align on it before any code lands.
   
     5. Acked-only e2e jitter. Restricting global jitter to fully-acked tuples 
introduces a selection bias: failed and  
     timed-out tuples are often exactly the ones with pathological latency. 
Worth either including them or documenting
     the bias clearly.                                                          
                                         
                     
     6. Clock skew. "Constant skew cancels mathematically" is true, but e2e 
measurements cross workers/nodes where the   
     relevant problem is drift, not constant offset. NTP-synced clocks still 
drift on the order of ms, which can dominate
      the jitter signal at low latencies. Worth calling out as a real 
limitation rather than a dismissed one (at least in the documentation of such a 
feature)
                     
   Happy to help review a PR once the integration point and the 
EWMA-vs-histogram-percentile question are settled. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to