[
https://issues.apache.org/jira/browse/FLINK-38872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18053760#comment-18053760
]
rstest commented on FLINK-38872:
--------------------------------
Submitted a simple fix in Github PR for this issue!
> waitForAllTaskRunning Hang Indefinitely
> ---------------------------------------
>
> Key: FLINK-38872
> URL: https://issues.apache.org/jira/browse/FLINK-38872
> Project: Flink
> Issue Type: Bug
> Components: Test Infrastructure, Tests
> Affects Versions: 2.2.0
> Reporter: rstest
> Priority: Major
> Labels: pull-request-available
> Attachments: timeout.patch
>
>
> `CommonTestUtils.waitForAllTaskRunning()` does not have a timeout mechanism
> and may hang indefinitely if the job being waited on is lost, cancelled, or
> enters an unexpected terminal state.
>
> {code:java}
> public static void waitForAllTaskRunning(
> MiniCluster miniCluster, JobID jobId, boolean allowFinished) throws
> Exception {
> waitForAllTaskRunning(() -> getGraph(miniCluster, jobId), allowFinished);
> } {code}
> This method calls `waitUntilCondition()` which polls indefinitely with no
> upper bound on wait time:
> {code:java}
> public static void waitUntilCondition(
> SupplierWithException<Boolean, Exception> condition) throws Exception
> {
> waitUntilCondition(condition, Duration.ofMillis(1));
> } {code}
> If the job identified by `jobId` is:
> - Lost due to cluster failure or restart
> - Cancelled unexpectedly
> - Failed and entered a terminal state
> Never properly scheduled The method will continue polling forever, causing
> the test to hang indefinitely rather than failing with a clear error message.
>
> h2. Scenario Example
> {code:java}
> JobClient jobClient = env.executeAsync();
> // If something causes the job to be lost here (e.g., cluster issue)
> // This will hang forever because the job no longer exists
> CommonTestUtils.waitForAllTaskRunning(
> miniCluster, jobClient.getJobID(), false); {code}
>
> Add a timeout parameter to `waitForAllTaskRunning()` to ensure tests fail
> fast with a clear error message rather than hanging indefinitely.
> I attached a proposed patch for this issue and happy to send a PR for the
> issue if you think this is reasonable.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)