haiqingchen created FLINK-39620:
-----------------------------------

             Summary: Cumulative window re-emits old windows after recovery due 
to stale InternalTimerService.currentWatermark in GlobalAggCombiner
                 Key: FLINK-39620
                 URL: https://issues.apache.org/jira/browse/FLINK-39620
             Project: Flink
          Issue Type: Bug
          Components: Table SQL / Runtime
    Affects Versions: 1.20.4, 2.1.1
         Environment: Flink 1.18.x (reproduced locally); 
logic is the same in 1.20.x and master.
SQL pattern: CUMULATE with local-global two-phase aggregation.
Event-time with shiftTimeZone = Asia/Shanghai.
            Reporter: haiqingchen


After a task restart / checkpoint recovery, a CUMULATE window with local-global 
aggregation may re-emit already-emitted historical windows (a.k.a. "backfill"). 
Root cause is a dual watermark inconsistency during the recovery warm-up 
window: SlicingWindowOperator.currentWatermark is restored from state, but 
InternalTimerService.currentWatermark is still Long.MIN_VALUE until the first 
post-recovery watermark arrives. During this warm-up gap, GlobalAggCombiner 
uses the stale timer-service watermark to decide whether to register a window 
timer, incorrectly treats historical windows as "not fired", re-registers old 
timers, and once the real watermark advances those timers fire immediately and 
cause a chain of fireWindow -> nextTriggerWindow for cumulative windows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to