Hi Flink community,
We’re evaluating Flink CEP for a time-series monitoring use case and would appreciate guidance. *Use case (high level):* We monitor many time-series keys and need to raise an alert when a condition persists for a configured duration. Example: if point X stays above threshold Y continuously for T minutes, mark it as an alert. *The requirement: * Reliably promote a key from an initial HOLD state to RAISED after T minutes even if that key produces no further events. *What we tried (POC, high level):* POC environment: Managed Flink (e.g., 1.18+) consuming a streaming source; CEP patterns implemented in Java. Approach: used CEP with .within(Time.minutes(T)) to express “condition persists for T minutes.” Observation: timeouts did not fire for keys that became silent after the condition first appeared. Our investigation suggests .within() relies on event-time watermarks, and when no events arrive for a keyed partition the watermark does not advance, so the timeout waits indefinitely. Workarounds attempted: Adding custom timers alongside CEP — partially worked but introduced substantial complexity and state/timer cleanup concerns. Sending heartbeat/synthetic events to advance watermarks — increased event noise and complicated correctness, not acceptable for us. *Questions:* Is using Flink CEP for “condition true continuously for T minutes” a recommended approach? If yes, what patterns do you recommend handling the inactivity case reliably? Is there a recommended way to make CEP timeouts fire when partitions are idle (e.g., watermark idleness settings, punctuators, or processing-time alternatives)? If CEP is not ideal, what alternative pattern do you recommend (KeyedProcessFunction with processing-time timers, hybrid CEP+timers, external scheduler + state, etc.) considering scalability as well as the need to monitor millions of points? Best regards, Rohit
