[ 
https://issues.apache.org/jira/browse/FLINK-32957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833184#comment-17833184
 ] 

Piotr Nowojski commented on FLINK-32957:
----------------------------------------

{{mailboxLatencyMs}} shows basically the same thing AFAIK. That is sampled time 
how long things are waiting in the mailbox queue before being executed, and 
timers are fired via the mailbox.

> Add current timer trigger lag to metrics
> ----------------------------------------
>
>                 Key: FLINK-32957
>                 URL: https://issues.apache.org/jira/browse/FLINK-32957
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics, Runtime / State Backends
>            Reporter: Rui Xia
>            Priority: Minor
>
> Timer trigger lag denotes the gap between the actual trigger timestamp and 
> the expected trigger timestamp (registered timestamp to `TimeService`). This 
> metric can aid users to find out whether there is a backlog of timers. 
> The backlog of timers may affect downstream data processing. Users customize 
> the trigger logic, which may interact with downstream data processing. For 
> example, a trigger logic can inject some records to downstream operators. The 
> backlog of timers blocks the record injection. 
> On the other side, The backlog of timers makes jobs unstable. Timers are used 
> by window operators, which leverage a timer to remove the window state of a 
> triggered window. The backlog of timers blocks data removal, and the state 
> size may grow unexpectedly large. The large state size affects the 
> performance of state-backend. In cloud-native environment, a k8s pod is prone 
> to reach local disk limit due to large state files (RocksDB SST).
> Currently, users are hard to observe the backlog of timers. As far as I 
> known, heap dump is the only way to learn the backlog of timers. Thus, users 
> cannot notice the backlog of timers in time. FLINK-32954 
> (https://issues.apache.org/jira/browse/FLINK-32954) exposes number of heap 
> timers, but is not suitable for RocksDB timer due to performance loss.
> Compare with FLINK-32954, timer trigger lag is much more lightweight for 
> RocksDB timer. 
>  * Reason 1: Timer trigger lag does not affect timer registering. 
>  * Reason 2: The effect on timer triggering is limited. Timer registering is 
> a hot code-path, while timer triggering is much colder. In general, the 
> trigger interval is tens of second, and the timer trigger code-path is 
> invoked every tens of second. Thus, the addition of timer trigger lag 
> calculation has little performance overhead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to