[ 
https://issues.apache.org/jira/browse/FLINK-33414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17799308#comment-17799308
 ] 

Jiang Xin edited comment on FLINK-33414 at 12/21/23 9:26 AM:
-------------------------------------------------------------

[~mapohl] Could you assign this issue to me?

I fould the reason is that the 
`FineGrainedSlotManager.checkResourceRequirementsWithDelay` would schedule a 
resource requirements check with a delay of 50 seconds by default. The check 
would also be aware of the lack of slots and notify the JobMaster so that the 
scheduler doesn't need to wait until timeout(100ms in the test) to fail the 
job. If the resource requirements check thread is not finished within 50ms, the 
JobMaster will fail due to a Timeout exception. So we can fix the issue by 
disabling the resource requirements check or adjusting the checking delay and 
waiting timeout to make sure the Timeout exception never be thrown.


was (Author: jiang xin):
[~mapohl] Could you assign this issue to me?

I fould the reason is that the 
`FineGrainedSlotManager.checkResourceRequirementsWithDelay` would schedule a 
resource requirements check with a delay of 50 seconds by default. The check 
would also be aware of the lack of slots and notify the JobMaster so that the 
scheduler doesn't need to wait until timeout to fail the job. If the resource 
requirements check thread is not finished within 50ms, the JobMaster will fail 
due to a Timeout exception. So we can fix the issue by disabling the resource 
requirements check or adjusting the checking delay and waiting timeout to make 
sure the Timeout exception never be thrown.

> MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot fails due to 
> unexpected TimeoutException
> ---------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33414
>                 URL: https://issues.apache.org/jira/browse/FLINK-33414
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0
>            Reporter: Matthias Pohl
>            Priority: Critical
>              Labels: github-actions, test-stability
>
> We see this test instability in [this 
> build|https://github.com/XComp/flink/actions/runs/6695266358/job/18192039035#step:12:9253].
> {code:java}
> Error: 17:04:52 17:04:52.042 [ERROR] Failures: 
> 9252Error: 17:04:52 17:04:52.042 [ERROR]   
> MiniClusterITCase.testHandleStreamingJobsWhenNotEnoughSlot:120 
> 9253Oct 30 17:04:52 Expecting a throwable with root cause being an instance 
> of:
> 9254Oct 30 17:04:52   
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException
> 9255Oct 30 17:04:52 but was an instance of:
> 9256Oct 30 17:04:52   java.util.concurrent.TimeoutException: Timeout has 
> occurred: 100 ms
> 9257Oct 30 17:04:52   at 
> org.apache.flink.runtime.jobmaster.slotpool.PhysicalSlotRequestBulkCheckerImpl.lambda$schedulePendingRequestBulkWithTimestampCheck$0(PhysicalSlotRequestBulkCheckerImpl.java:86)
> 9258Oct 30 17:04:52   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> 9259Oct 30 17:04:52   at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> 9260Oct 30 17:04:52   ...(27 remaining lines not displayed - this can be 
> changed with Assertions.setMaxStackTraceElementsDisplayed) {code}
> The same error occurred in the [finegrained_resourcemanager stage of this 
> build|https://github.com/XComp/flink/actions/runs/6468655160/job/17563927249#step:11:26516]
>  (as reported in FLINK-33245).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to