[
https://issues.apache.org/jira/browse/FLINK-23611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404598#comment-17404598
]
Matthias edited comment on FLINK-23611 at 8/26/21, 5:20 AM:
------------------------------------------------------------
[~trohrmann]'s suspicion sounds like the most reasonable one for this issue.
It's hard to investigate due to the missing Hadoop logs. I was able to
reproduce the {{returnValue}} assertion mentioned above.
The {{returnValue}} is actually different, because the assertion is called
multiple times: Once for the YARN Session Cluster deployment and once for the
job submission. It was a bit misleading when debugging the code because both
calls used the same code base and the same Thread name. I cleaned up the code a
bit and added different Thread names as part of this refactoring. I hesitated
to refactor even more. There is no necessity to run the Job submission in a
separate thread. We always wait for the Thread to return. And for the cases,
where we didn't, we should. That was the timeout we ran into as part of the
test failure: The cluster didn't come up (at least that's our assumption for
now). Hence, the subsequent job submission failed with a {{returnValue}} of
{{1}} which caused the {{AssertionError}} we observed.
was (Author: mapohl):
[~trohrmann]'s suspicion sounds like the most reasonable one for this issue.
It's hard to investigate due to the missing Hadoop logs. I was able to
reproduce the {{returnValue}} assertion mentioned
[above|https://issues.apache.org/jira/browse/FLINK-23611?focusedCommentId=17404239&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17404239].
The {{returnValue}} is actually different, because the assertion is called
multiple times: Once for the YARN Session Cluster deployment and once for the
job submission. It was a bit misleading when debugging the code because both
calls used the same code base and the same Thread name. I cleaned up the code a
bit and added different Thread names as part of this refactoring. I hesitated
to refactor even more. There is no necessity to run the Job submission in a
separate thread. We always wait for the Thread to return. And for the cases,
where we didn't, we should. That was the timeout we ran into as part of the
test failure: The cluster didn't come up (at least that's our assumption for
now). Hence, the subsequent job submission failed with a {{returnValue}} of
{{1}} which caused the {{AssertionError}} we observed.
It's not clear to me, yet, why we're running into a timeout, though. We're
waiting for the job submission to return through [this join in
YARNSessionCapacitySchedulerITCase.submitJob|https://github.com/XComp/flink/blob/646ff2d36f40704f5dca017b8fffed78bd51b307/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L388].
But that should return as well since the underlying Thread fails and does not
end up in a loop.
The YARN Cluster Thread is explicitly stopped in the [finally block of the test
method|https://github.com/XComp/flink/blob/646ff2d36f40704f5dca017b8fffed78bd51b307/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L356].
Hence, it shouldn't block as well.
I enabled the Hadoop logging for the YARN tests which are printed to stdout.
This should help us investigating a similar issue in the future.
> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
> hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-23611
> URL: https://issues.apache.org/jira/browse/FLINK-23611
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.14.0, 1.12.5
> Reporter: Xintong Song
> Assignee: Matthias
> Priority: Major
> Labels: pull-request-available, test-stability
> Fix For: 1.14.0, 1.12.6
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21439&view=logs&j=245e1f2e-ba5b-5570-d689-25ae21e5302f&t=e7f339b2-a7c3-57d9-00af-3712d4b15354&l=28959
--
This message was sent by Atlassian Jira
(v8.3.4#803005)