[ 
https://issues.apache.org/jira/browse/FLINK-39481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088869#comment-18088869
 ] 

Efrat Levitan edited comment on FLINK-39481 at 6/14/26 9:16 AM:
----------------------------------------------------------------

Another testCumulateWindow_Cubep[3] 
[failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75899&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96&l=10432]
 
 
{{Jun 11 19:58:14 19:58:14.097 [ERROR] Failures: }}
{{Jun 11 19:58:14 19:58:14.097 [ERROR] 
org.apache.flink.table.planner.runtime.stream.sql.WindowDistinctAggregateITCase.testCumulateWindow_Cube
 }}
{{Jun 11 19:58:14 19:58:14.097 [INFO] Run 1: PASS }}
{{Jun 11 19:58:14 19:58:14.098 [INFO] Run 2: PASS }}
{{Jun 11 19:58:14 19:58:14.098 [ERROR] Run 3: 
WindowDistinctAggregateITCase.testCumulateWindow_Cube:688 }}
{{Jun 11 19:58:14 expected: }}
{{Jun 11 19:58:14 "0,a,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 }}
{{Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:10,6,19.98,5.0,1.0,3 }}
{{Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:15,6,19.98,5.0,1.0,3 }}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:10,2,6.66,6.0,3.0,2 }}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:15,2,6.66,6.0,3.0,2 }}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 
}}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 
}}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1 
}}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:30,2020-10-10T00:00:35,1,3.33,3.0,3.0,1 
}}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:30,2020-10-10T00:00:40,1,3.33,3.0,3.0,1 
}}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:30,2020-10-10T00:00:45,1,3.33,3.0,3.0,1 
}}
{{Jun 11 19:58:14 
0,null,2020-10-10T00:00:30,2020-10-10T00:00:35,1,7.77,7.0,7.0,0 }}
{{Jun 11 19:58:14 
0,null,2020-10-10T00:00:30,2020-10-10T00:00:40,1,7.77,7.0,7.0,0 }}
{{Jun 11 19:58:14 
0,null,2020-10-10T00:00:30,2020-10-10T00:00:45,1,7.77,7.0,7.0,0 }}
{{Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 
}}
{{Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:10,8,26.64,6.0,1.0,4 
}}
{{Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:15,8,26.64,6.0,1.0,4 
}}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 }}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 }}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1 }}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:30,2020-10-10T00:00:35,2,11.10,7.0,3.0,1 }}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:30,2020-10-10T00:00:40,2,11.10,7.0,3.0,1 }}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:30,2020-10-10T00:00:45,2,11.10,7.0,3.0,1" }}
{{Jun 11 19:58:14 but was: }}
{{Jun 11 19:58:14 "0,a,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 }}
{{Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:10,6,19.98,5.0,1.0,3 }}
{{Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:15,6,19.98,5.0,1.0,3 }}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:10,2,6.66,6.0,3.0,2 }}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:15,2,6.66,6.0,3.0,2 }}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 
}}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 
}}
{{Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1 
}}
{{Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 
}}
{{Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:10,8,26.64,6.0,1.0,4 
}}
{{Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:15,8,26.64,6.0,1.0,4 
}}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 }}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 }}
{{Jun 11 19:58:14 
1,null,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1"}}
{{Jun 11 19:58:14 19:58:14.098 [INFO] Run 4: PASS }}
{{ }}


was (Author: JIRAUSER302736):
Another testCumulateWindow_Cubep[3] 
[failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=75899&view=logs&j=0c940707-2659-5648-cbe6-a1ad63045f0a&t=019d96ac-af5c-5c6d-106d-b67f63cc6e96&l=10432]
 
 
Jun 11 19:58:14 19:58:14.097 [ERROR] Failures: 
Jun 11 19:58:14 19:58:14.097 [ERROR] 
org.apache.flink.table.planner.runtime.stream.sql.WindowDistinctAggregateITCase.testCumulateWindow_Cube
 
Jun 11 19:58:14 19:58:14.097 [INFO] Run 1: PASS 
Jun 11 19:58:14 19:58:14.098 [INFO] Run 2: PASS 
Jun 11 19:58:14 19:58:14.098 [ERROR] Run 3: 
WindowDistinctAggregateITCase.testCumulateWindow_Cube:688 
Jun 11 19:58:14 expected: 
Jun 11 19:58:14 "0,a,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 
Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:10,6,19.98,5.0,1.0,3 
Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:15,6,19.98,5.0,1.0,3 
Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:10,2,6.66,6.0,3.0,2 
Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:15,2,6.66,6.0,3.0,2 
Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 0,b,2020-10-10T00:00:30,2020-10-10T00:00:35,1,3.33,3.0,3.0,1 
Jun 11 19:58:14 0,b,2020-10-10T00:00:30,2020-10-10T00:00:40,1,3.33,3.0,3.0,1 
Jun 11 19:58:14 0,b,2020-10-10T00:00:30,2020-10-10T00:00:45,1,3.33,3.0,3.0,1 
Jun 11 19:58:14 0,null,2020-10-10T00:00:30,2020-10-10T00:00:35,1,7.77,7.0,7.0,0 
Jun 11 19:58:14 0,null,2020-10-10T00:00:30,2020-10-10T00:00:40,1,7.77,7.0,7.0,0 
Jun 11 19:58:14 0,null,2020-10-10T00:00:30,2020-10-10T00:00:45,1,7.77,7.0,7.0,0 
Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 
Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:10,8,26.64,6.0,1.0,4 
Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:15,8,26.64,6.0,1.0,4 
Jun 11 19:58:14 1,null,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 1,null,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 1,null,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 
1,null,2020-10-10T00:00:30,2020-10-10T00:00:35,2,11.10,7.0,3.0,1 
Jun 11 19:58:14 
1,null,2020-10-10T00:00:30,2020-10-10T00:00:40,2,11.10,7.0,3.0,1 
Jun 11 19:58:14 
1,null,2020-10-10T00:00:30,2020-10-10T00:00:45,2,11.10,7.0,3.0,1" 
Jun 11 19:58:14 but was: 
Jun 11 19:58:14 "0,a,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 
Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:10,6,19.98,5.0,1.0,3 
Jun 11 19:58:14 0,a,2020-10-10T00:00,2020-10-10T00:00:15,6,19.98,5.0,1.0,3 
Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:10,2,6.66,6.0,3.0,2 
Jun 11 19:58:14 0,b,2020-10-10T00:00,2020-10-10T00:00:15,2,6.66,6.0,3.0,2 
Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 0,b,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:05,4,11.10,5.0,1.0,2 
Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:10,8,26.64,6.0,1.0,4 
Jun 11 19:58:14 1,null,2020-10-10T00:00,2020-10-10T00:00:15,8,26.64,6.0,1.0,4 
Jun 11 19:58:14 1,null,2020-10-10T00:00:15,2020-10-10T00:00:20,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 1,null,2020-10-10T00:00:15,2020-10-10T00:00:25,1,4.44,4.0,4.0,1 
Jun 11 19:58:14 1,null,2020-10-10T00:00:15,2020-10-10T00:00:30,1,4.44,4.0,4.0,1"
Jun 11 19:58:14 19:58:14.098 [INFO] Run 4: PASS 
 

> ⁠[tests] Fix flaky 
> WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets⁠ 
> ----------------------------------------------------------------------------------
>
>                 Key: FLINK-39481
>                 URL: https://issues.apache.org/jira/browse/FLINK-39481
>             Project: Flink
>          Issue Type: Bug
>          Components: Table SQL / Planner
>            Reporter: featzhang
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>
> The test `WindowDistinctAggregateITCase#testCumulateWindow_GroupingSets` is 
> flaky 
> and fails intermittently in CI due to a race condition in the test framework 
> itself.
> 1. Failure Evidence
> Azure Build #74194 (2026-04-17):
>  - Run 1: PASS
>  - Run 2: PASS
>  - Run 3: FAIL ← flaky
>  - Run 4: PASS
> Error message:
> org.opentest4j.AssertionFailedError:
> Expected 23 rows but got 14 rows.
> Missing rows (all related to window [2020-10-10T00:00:30, ...]):
>  - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:35,1,3.33,3.0,3.0,1
>  - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:40,1,3.33,3.0,3.0,1
>  - 0,b,2020-10-10T00:00:30,2020-10-10T00:00:45,1,3.33,3.0,3.0,1
>  - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:35,1,7.77,7.0,7.0,0
>  - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:40,1,7.77,7.0,7.0,0
>  - 0,null,2020-10-10T00:00:30,2020-10-10T00:00:45,1,7.77,7.0,7.0,0
>  - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:35,2,11.10,7.0,3.0,1
>  - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:40,2,11.10,7.0,3.0,1
>  - 1,null,2020-10-10T00:00:30,2020-10-10T00:00:45,2,11.10,7.0,3.0,1
> 2. Root Cause Analysis
> The test uses `FailingCollectionSource` with `'failing-source' = 'true'`, 
> which intentionally throws an artificial exception to trigger 
> checkpoint-restore path.
> The source emits 11 rows with an event at timestamp `2020-10-10 00:00:32` and 
> `2020-10-10 00:00:34` at the end of the dataset.
> The flakiness stems from a race condition in `FailingCollectionSource`:
> 1. The source uses a checkpoint-triggered failure mechanism. When the failure 
> occurs, only the rows emitted BEFORE the checkpoint are guaranteed to be 
> replayed after restore.
> 2. The last 2 rows (timestamps 00:00:32 and 00:00:34) advance the watermark 
> past 
> `00:00:31`, triggering the cumulate windows ending at 00:00:35/40/45. 
> If the failure and restore happen AFTER the source emits rows 
> at 00:00:32/00:00:34 but BEFORE the downstream operators flush their state, 
> those late-window results can be lost.
> 3. Specifically, the issue is that `FailingCollectionSource` uses a legacy 
> `SourceFunction` API with `ctx.getCheckpointLock()`. The timing of when 
> `numSuccessfulCheckpoints >= 1 && lastCheckpointedEmittedNum >= 1` 
> is satisfied relative to when the last 2 rows are emitted is 
> non-deterministic,
> meaning sometimes the last 2 rows are emitted before the artificial failure 
> is 
> triggered (correct case), and sometimes after (data loss case).
>  
> 3. Impact
>  - Flink CI blocked intermittently on the `test_ci table` job
>  - 3 out of 4 retry runs pass, demonstrating it's a classic flaky test pattern
>  - Affects: WindowDistinctAggregateITCase (CUMULATE window with GROUPING 
> SETS/CUBE/ROLLUP)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to