Martijn Visser created FLINK-40011:
--------------------------------------
Summary:
CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider is
flaky: test source can finish before NoMoreSplitsEvent is delivered
Key: FLINK-40011
URL: https://issues.apache.org/jira/browse/FLINK-40011
Project: Flink
Issue Type: Bug
Components: Table SQL / Planner, Tests
Affects Versions: 2.4.0
Reporter: Martijn Visser
Assignee: Martijn Visser
testStreamRecordTimestampInserterSinkRuntimeProvider fails intermittently when
the
job fails instead of finishing.
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=76431&view=results
(leg: test_ci table)
{code}
org.apache.flink.table.api.TableException: Failed to wait job finish
at
CommonExecSinkITCase.testStreamRecordTimestampInserterSinkRuntimeProvider(CommonExecSinkITCase.java:127)
Caused by: JobExecutionException: Job execution failed.
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
Caused by: org.apache.flink.util.FlinkException: An OperatorEvent from an
OperatorCoordinator to a task was lost. Triggering task failover to ensure
consistency. Event: '[NoMoreSplitEvent]', targetTask: Source: T1[154] ->
WatermarkAssigner[155] -> StreamRecordTimestampInserter[156] -> T1[156]: Writer
(1/4) - execution #0
Caused by:
org.apache.flink.runtime.operators.coordination.TaskNotRunningException: Task
is not running, but in state FINISHED
{code}
Root cause: the in-test TestSource reader returned END_OF_INPUT as soon as it
had
emitted its rows. The SingleSplitEnumerator assigns the split and then delivers
a
NoMoreSplitsEvent as a separate event via the coordinator thread. If the
split-holding
subtask (here, (1/4)) finishes before that event is delivered, the coordinator
sends
it to an already-FINISHED task; the framework treats the undeliverable event as
a lost
OperatorEvent and triggers a failover, which the job's
NoRestartBackoffTimeStrategy
turns into a hard failure. No data is lost (all rows are emitted) — this is a
benign
test-config race, not a coordinator bug.
Fix: defer END_OF_INPUT until notifyNoMoreSplits() has been received, so the
reader
cannot finish before the event lands (mirroring
SourceReaderBase.finishedOrAvailableLater()).
First observed in build 76431; the fix is justified on the structural race, not
frequency.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)