[
https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544790#comment-17544790
]
Niklas Semmler commented on FLINK-24960:
----------------------------------------
I finally found an explanation for the aspect 2: Why the yarn session thread
sometimes continues to run even when an exception is thrown.
There is a race condition happening between the [interactive CLI of the YARN
session|https://github.com/apache/flink/blob/6086e327cd4168e09eac4f6b0b86fb29ebe3860c/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L874]
and the [job
submission|https://github.com/apache/flink/blob/98a6a5432b642aa647f6edcd60dae49ef9093786/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YarnTestBase.java#L890].
To explain the problem:
# The _main thread_ executing the test starts two threads: the _jobmanager
thread_ executing the job manager as part of a yarn session and the _submission
thread_ submitting the Flink job.
# The _jobmanager thread_ is created before and ends after the _submission
thread_.
# Communication between _main thread_ and the other threads happens via
rerouted stdin, stdout, stderr channels. The rerouting takes place when the
_jobmanager thread_ and _submission thread_ are created respectively.
# The _main thread_ waits for the _jobmanager thread_ and _submission thread_
to print a specific output message to the rerouted stdout before continuing.
# The _jobmanager thread_ needs to be explicitly shutdown via "stop" string
communicated via the rerouted stdin
The problem appears, if _submission thread_ reroutes the stdin before
_jobmanager thread_ opens a BufferedReader on the old stdin. In this case, the
stop message from the _main thread_ to the _jobmanager thread_ is lost and the
{*}_jobmanager thread_ continues running indefinitely{*}. In this cases even an
exception will not fail the test.
We can improve on this by changing the output the _main thread_ matches on.
Instead of the
[current|https://github.com/apache/flink/blob/b98c66cfe44d1b4002fe56dbf323d8ea8ce0409c/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNSessionCapacitySchedulerITCase.java#L309],
we could add a new output after the creation of the
[BufferedReader|https://github.com/apache/flink/blob/6086e327cd4168e09eac4f6b0b86fb29ebe3860c/flink-yarn/src/main/java/org/apache/flink/yarn/cli/FlinkYarnSessionCli.java#L874].
The question here is whether adding an output line here will have some
unintended consequences. The code is directly used by the FLINK Yarn session
CLI, so if anybody is already parsing the output this may have adverse effects.
Also, if we change it here, we may also want to touch other tests that make use
of the same code.
Alternative approaches that only touch the tests are:
* move the current println statement further down in the code (see
[PR19852|https://github.com/apache/flink/pull/19852])
* add a (one second?) delay before the job submission
However, both of these solutions only reduce the chance that the test
instability will appear.
> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
> hangs on azure
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-24960
> URL: https://issues.apache.org/jira/browse/FLINK-24960
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.15.0, 1.14.3
> Reporter: Yun Gao
> Assignee: Niklas Semmler
> Priority: Critical
> Labels: pull-request-available, test-stability
> Fix For: 1.16.0
>
>
> {code:java}
> Nov 18 22:37:08
> ================================================================================
> Nov 18 22:37:08 Test
> testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase)
> is running.
> Nov 18 22:37:08
> --------------------------------------------------------------------------------
> Nov 18 22:37:25 22:37:25,470 [ main] INFO
> org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted
> hostname:port: 5718b812c7ab:38622
> Nov 18 22:52:36
> ==============================================================================
> Nov 18 22:52:36 Process produced no output for 900 seconds.
> Nov 18 22:52:36
> ==============================================================================
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722&view=logs&j=f450c1a5-64b1-5955-e215-49cb1ad5ec88&t=cc452273-9efa-565d-9db8-ef62a38a0c10&l=36395
--
This message was sent by Atlassian Jira
(v8.20.7#820007)