haiqingchen created FLINK-39620:
-----------------------------------
Summary: Cumulative window re-emits old windows after recovery due
to stale InternalTimerService.currentWatermark in GlobalAggCombiner
Key: FLINK-39620
URL: https://issues.apache.org/jira/browse/FLINK-39620
Project: Flink
Issue Type: Bug
Components: Table SQL / Runtime
Affects Versions: 1.20.4, 2.1.1
Environment: Flink 1.18.x (reproduced locally);
logic is the same in 1.20.x and master.
SQL pattern: CUMULATE with local-global two-phase aggregation.
Event-time with shiftTimeZone = Asia/Shanghai.
Reporter: haiqingchen
After a task restart / checkpoint recovery, a CUMULATE window with local-global
aggregation may re-emit already-emitted historical windows (a.k.a. "backfill").
Root cause is a dual watermark inconsistency during the recovery warm-up
window: SlicingWindowOperator.currentWatermark is restored from state, but
InternalTimerService.currentWatermark is still Long.MIN_VALUE until the first
post-recovery watermark arrives. During this warm-up gap, GlobalAggCombiner
uses the stale timer-service watermark to decide whether to register a window
timer, incorrectly treats historical windows as "not fired", re-registers old
timers, and once the real watermark advances those timers fire immediately and
cause a chain of fireWindow -> nextTriggerWindow for cumulative windows.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)