[
https://issues.apache.org/jira/browse/FLINK-32957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17833184#comment-17833184
]
Piotr Nowojski commented on FLINK-32957:
----------------------------------------
{{mailboxLatencyMs}} shows basically the same thing AFAIK. That is sampled time
how long things are waiting in the mailbox queue before being executed, and
timers are fired via the mailbox.
> Add current timer trigger lag to metrics
> ----------------------------------------
>
> Key: FLINK-32957
> URL: https://issues.apache.org/jira/browse/FLINK-32957
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Metrics, Runtime / State Backends
> Reporter: Rui Xia
> Priority: Minor
>
> Timer trigger lag denotes the gap between the actual trigger timestamp and
> the expected trigger timestamp (registered timestamp to `TimeService`). This
> metric can aid users to find out whether there is a backlog of timers.
> The backlog of timers may affect downstream data processing. Users customize
> the trigger logic, which may interact with downstream data processing. For
> example, a trigger logic can inject some records to downstream operators. The
> backlog of timers blocks the record injection.
> On the other side, The backlog of timers makes jobs unstable. Timers are used
> by window operators, which leverage a timer to remove the window state of a
> triggered window. The backlog of timers blocks data removal, and the state
> size may grow unexpectedly large. The large state size affects the
> performance of state-backend. In cloud-native environment, a k8s pod is prone
> to reach local disk limit due to large state files (RocksDB SST).
> Currently, users are hard to observe the backlog of timers. As far as I
> known, heap dump is the only way to learn the backlog of timers. Thus, users
> cannot notice the backlog of timers in time. FLINK-32954
> (https://issues.apache.org/jira/browse/FLINK-32954) exposes number of heap
> timers, but is not suitable for RocksDB timer due to performance loss.
> Compare with FLINK-32954, timer trigger lag is much more lightweight for
> RocksDB timer.
> * Reason 1: Timer trigger lag does not affect timer registering.
> * Reason 2: The effect on timer triggering is limited. Timer registering is
> a hot code-path, while timer triggering is much colder. In general, the
> trigger interval is tens of second, and the timer trigger code-path is
> invoked every tens of second. Thus, the addition of timer trigger lag
> calculation has little performance overhead.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)