[jira] [Comment Edited] (FLINK-23611) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots hangs on azure

Matthias (Jira) Wed, 25 Aug 2021 22:21:10 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404598#comment-17404598
 ]


Matthias edited comment on FLINK-23611 at 8/26/21, 5:20 AM:
------------------------------------------------------------

[~trohrmann]'s suspicion sounds like the most reasonable one for this issue. 
It's hard to investigate due to the missing Hadoop logs. I was able to 
reproduce the {{returnValue}} assertion mentioned above.

The {{returnValue}} is actually different, because the assertion is called 
multiple times: Once for the YARN Session Cluster deployment and once for the 
job submission. It was a bit misleading when debugging the code because both 
calls used the same code base and the same Thread name. I cleaned up the code a 
bit and added different Thread names as part of this refactoring. I hesitated 
to refactor even more. There is no necessity to run the Job submission in a 
separate thread. We always wait for the Thread to return. And for the cases, 
where we didn't, we should. That was the timeout we ran into as part of the 
test failure: The cluster didn't come up (at least that's our assumption for 
now). Hence, the subsequent job submission failed with a {{returnValue}} of 
{{1}} which caused the {{AssertionError}} we observed.


was (Author: mapohl):
[~trohrmann]'s suspicion sounds like the most reasonable one for this issue. 
It's hard to investigate due to the missing Hadoop logs. I was able to 
reproduce the {{returnValue}} assertion mentioned 
[above|https://issues.apache.org/jira/browse/FLINK-23611?focusedCommentId=17404239&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17404239].

The {{returnValue}} is actually different, because the assertion is called 
multiple times: Once for the YARN Session Cluster deployment and once for the 
job submission. It was a bit misleading when debugging the code because both 
calls used the same code base and the same Thread name. I cleaned up the code a 
bit and added different Thread names as part of this refactoring. I hesitated 
to refactor even more. There is no necessity to run the Job submission in a 
separate thread. We always wait for the Thread to return. And for the cases, 
where we didn't, we should. That was the timeout we ran into as part of the 
test failure: The cluster didn't come up (at least that's our assumption for 
now). Hence, the subsequent job submission failed with a {{returnValue}} of 
{{1}} which caused the {{AssertionError}} we observed.

It's not clear to me, yet, why we're running into a timeout, though. We're 
waiting for the job submission to return through [this join in 
YARNSessionCapacitySchedulerITCase.submitJob|https://github.com/XComp/flink/blob/646ff2d36f40704f5dca017b8fffed78bd51b307/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L388].
 But that should return as well since the underlying Thread fails and does not 
end up in a loop.

The YARN Cluster Thread is explicitly stopped in the [finally block of the test 
method|https://github.com/XComp/flink/blob/646ff2d36f40704f5dca017b8fffed78bd51b307/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L356].
 Hence, it shouldn't block as well.

I enabled the Hadoop logging for the YARN tests which are printed to stdout. 
This should help us investigating a similar issue in the future.

> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
>  hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-23611
>                 URL: https://issues.apache.org/jira/browse/FLINK-23611
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.14.0, 1.12.5
>            Reporter: Xintong Song
>            Assignee: Matthias
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 1.14.0, 1.12.6
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21439&view=logs&j=245e1f2e-ba5b-5570-d689-25ae21e5302f&t=e7f339b2-a7c3-57d9-00af-3712d4b15354&l=28959



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-23611) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots hangs on azure

Reply via email to