Hi Flink community,


We’re evaluating Flink CEP for a time-series monitoring use case and would
appreciate guidance.



*Use case (high level):*

We monitor many time-series keys and need to raise an alert when a
condition persists for a configured duration. Example: if point X stays
above threshold Y continuously for T minutes, mark it as an alert.



*The requirement: *

Reliably promote a key from an initial HOLD state to RAISED after T minutes
even if that key produces no further events.



*What we tried (POC, high level):*

POC environment: Managed Flink (e.g., 1.18+) consuming a streaming source;
CEP patterns implemented in Java.

Approach: used CEP with .within(Time.minutes(T)) to express “condition
persists for T minutes.”

Observation: timeouts did not fire for keys that became silent after the
condition first appeared. Our investigation suggests .within() relies on
event-time watermarks, and when no events arrive for a keyed partition the
watermark does not advance, so the timeout waits indefinitely.

Workarounds attempted:

Adding custom timers alongside CEP — partially worked but introduced
substantial complexity and state/timer cleanup concerns.

Sending heartbeat/synthetic events to advance watermarks — increased event
noise and complicated correctness, not acceptable for us.



*Questions:*

Is using Flink CEP for “condition true continuously for T minutes” a
recommended approach? If yes, what patterns do you recommend handling the
inactivity case reliably?

Is there a recommended way to make CEP timeouts fire when partitions are
idle (e.g., watermark idleness settings, punctuators, or processing-time
alternatives)?

If CEP is not ideal, what alternative pattern do you recommend
(KeyedProcessFunction with processing-time timers, hybrid CEP+timers,
external scheduler + state, etc.) considering scalability as well as the
need to monitor millions of points?



Best regards,

Rohit

Reply via email to