[
https://issues.apache.org/jira/browse/FLINK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799308#comment-17799308
]
Jiang Xin edited comment on FLINK-33414 at 12/21/23 8:59 AM:
-------------------------------------------------------------
[~mapohl] Could you assign this issue to me?
I fould the reason is that the
`FineGrainedSlotManager.checkResourceRequirementsWithDelay` would schedule a
resource requirements check with a delay of 50 seconds by default. The check
would also be aware of the lack of slots and notify the JobMaster so that the
scheduler doesn't need to wait until timeout to fail the job. If the resource
requirements check thread is not finished within 50ms, the JobMaster will fail
due to a Timeout exception. So we can fix the issue by disabling the resource
requirements check or adjusting the checking delay and waiting timeout to make
sure the Timeout exception never be thrown.
was (Author: jiang xin):
I also met the exception and after delving into the code, I found the reason is
that the `FineGrainedSlotManager.checkResourceRequirementsWithDelay` would
schedule a resource requirements check with a delay of 50 seconds by default.
The check would also be aware of the lack of slots and notify the JobMaster so
that the scheduler doesn't need to wait until timeout to fail the job. If the
resource requirements check thread is not finished within 50ms, the JobMaster
will fail due to a Timeout exception. So we can fix the issue by disable the
resource requirements check or adjust the check delay and timeout to make sure
the Timeout exception never be thrown.
> MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot fails due to
> unexpected TimeoutException
> ---------------------------------------------------------------------------------------------------
>
> Key: FLINK-33414
> URL: https://issues.apache.org/jira/browse/FLINK-33414
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.19.0
> Reporter: Matthias Pohl
> Priority: Critical
> Labels: github-actions, test-stability
>
> We see this test instability in [this
> build|https://github.com/XComp/flink/actions/runs/6695266358/job/18192039035#step:12:9253].
> {code:java}
> Error: 17:04:52 17:04:52.042 [ERROR] Failures:
> 9252Error: 17:04:52 17:04:52.042 [ERROR]
> MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot:120
> 9253Oct 30 17:04:52 Expecting a throwable with root cause being an instance
> of:
> 9254Oct 30 17:04:52
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> 9255Oct 30 17:04:52 but was an instance of:
> 9256Oct 30 17:04:52 java.util.concurrent.TimeoutException: Timeout has
> occurred: 100 ms
> 9257Oct 30 17:04:52 at
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86)
> 9258Oct 30 17:04:52 at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> 9259Oct 30 17:04:52 at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 9260Oct 30 17:04:52 ...(27 remaining lines not displayed - this can be
> changed with Assertions.setMaxStackTraceElementsDisplayed) {code}
> The same error occurred in the [finegrained_resourcemanager stage of this
> build|https://github.com/XComp/flink/actions/runs/6468655160/job/17563927249#step:11:26516]
> (as reported in FLINK-33245).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)