GGraziadei opened a new issue, #8583: URL: https://github.com/apache/storm/issues/8583
Currently, Apache Storm provides comprehensive metrics for throughput and average latency (execute-latency, process-latency). However, in high-precision real-time systems, averages often mask critical performance instabilities. This proposal introduces a native Jitter Metric calculated at two levels: - Component level (Step Jitter): Measures the variance in execution time within individual Bolts and Spouts. - Topology level (Global Jitter): Measures the variance in e2e completion latency for fully acked tuples. Motivation: Why Jitter Matters for Real-Time In deterministic real-time processing, the variance of the latency is as important as the latency itself (https://ieeexplore.ieee.org/abstract/document/10877871). ## Why analysing jitter matters for real-time In deterministic real-time processing, predictability of latency is as important as latency itself. This is a constraint to building a deterministic system. - Mcro-burst detection: high jitter reveals short spikes that average latency smooths out. - Compliance: modern SLAs rely on percentiles (e.g., P99). Jitter is a strong leading indicator of tail-latency degradation. - Root Cause Analysis: high component jitter means GC pressure or resource contention; instead, high global jitter with stable components suggests network congestion or shuffle bottlenecks. - Bottleneck identification: jitter enables precise identification of where bottlenecks occur in the topology and helps distinguish their underlying causes, making performance issues easier to diagnose and resolve. ### Proposed model: Exponentially Weighted Moving Average (EWMA) To ensure negligible performance impact, I propose to use an Exponentially Weighted Moving Average (EWMA), following RFC 1889 logic https://www.rfc-editor.org/rfc/rfc1889#appendix-A.8 Mathematical Model: J_new = J_old + (|D_current - D_previous| - J_old) / 16 ``` GIVEN a State {ewmaJitter, lastTransit} PROCEDURE addValue(transitMs) IF transitMs < 0 THEN EXIT PROCEDURE IF lastTransit IS NOT UNINITIALIZED THEN // Calculate the absolute difference between the current and previous transit time deviation = ABS(transitMs - lastTransit) // Update the Exponentially Weighted Moving Average using the RFC 1889 smoothing factor ewmaJitter = ewmaJitter + (deviation - ewmaJitter) * RFC1889_ALPHA END IF // Store current transit time for the next iteration lastTransit = transitMs END PROCEDURE ``` Performance impact - Minimal computational overhead: by utilizing an EWMA, we avoid the need for storing large datasets or sliding window buffers. The jitter is updated via a single linear equation, requiring only basic arithmetic. - Memory efficiency: The EWMA algorithm is extremely memory-light, requiring only a single persistent variable (8 bytes) per executor to maintain the moving average state, plus a reference for the previous latency sample. - System calls: To eliminate redundant overhead, the metric hooks into existing latency tracking logic. This point requires additional brainstorming to evaluate already sampled metrics. ### Limitations and constraints - Clock skew: Global jitter may be affected in the case of unsynchronised nodes. However, since jitter measures variance between consecutive samples, constant skew cancels out mathematically. - Sampling bias: Low sampling rates may miss high-frequency jitter spikes. - Warm-up: as an EWMA-based metric, values may fluctuate initially before stabilizing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
