[jira] [Comment Edited] (FLINK-33414) MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot fails due to unexpected TimeoutException

Jiang Xin (Jira) Thu, 21 Dec 2023 01:00:09 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799308#comment-17799308
 ]


Jiang Xin edited comment on FLINK-33414 at 12/21/23 8:59 AM:
-------------------------------------------------------------

[~mapohl] Could you assign this issue to me?

I fould the reason is that the 
`FineGrainedSlotManager.checkResourceRequirementsWithDelay` would schedule a 
resource requirements check with a delay of 50 seconds by default. The check 
would also be aware of the lack of slots and notify the JobMaster so that the 
scheduler doesn't need to wait until timeout to fail the job. If the resource 
requirements check thread is not finished within 50ms, the JobMaster will fail 
due to a Timeout exception. So we can fix the issue by disabling the resource 
requirements check or adjusting the checking delay and waiting timeout to make 
sure the Timeout exception never be thrown.


was (Author: jiang xin):
I also met the exception and after delving into the code, I found the reason is 
that the `FineGrainedSlotManager.checkResourceRequirementsWithDelay` would 
schedule a resource requirements check with a delay of 50 seconds by default. 
The check would also be aware of the lack of slots and notify the JobMaster so 
that the scheduler doesn't need to wait until timeout to fail the job. If the 
resource requirements check thread is not finished within 50ms, the JobMaster 
will fail due to a Timeout exception. So we can fix the issue by disable the 
resource requirements check or adjust the check delay and timeout to make sure 
the Timeout exception never be thrown.

> MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot fails due to 
> unexpected TimeoutException
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33414
>                 URL: https://issues.apache.org/jira/browse/FLINK-33414
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0
>            Reporter: Matthias Pohl
>            Priority: Critical
>              Labels: github-actions, test-stability
>
> We see this test instability in [this 
> build|https://github.com/XComp/flink/actions/runs/6695266358/job/18192039035#step:12:9253].
> {code:java}
> Error: 17:04:52 17:04:52.042 [ERROR] Failures: 
> 9252Error: 17:04:52 17:04:52.042 [ERROR]   
> MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot:120 
> 9253Oct 30 17:04:52 Expecting a throwable with root cause being an instance 
> of:
> 9254Oct 30 17:04:52   
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> 9255Oct 30 17:04:52 but was an instance of:
> 9256Oct 30 17:04:52   java.util.concurrent.TimeoutException: Timeout has 
> occurred: 100 ms
> 9257Oct 30 17:04:52   at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86)
> 9258Oct 30 17:04:52   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> 9259Oct 30 17:04:52   at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 9260Oct 30 17:04:52   ...(27 remaining lines not displayed - this can be 
> changed with Assertions.setMaxStackTraceElementsDisplayed) {code}
> The same error occurred in the [finegrained_resourcemanager stage of this 
> build|https://github.com/XComp/flink/actions/runs/6468655160/job/17563927249#step:11:26516]
>  (as reported in FLINK-33245).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-33414) MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot fails due to unexpected TimeoutException

Reply via email to