[
https://issues.apache.org/jira/browse/SPARK-12516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075693#comment-15075693
]
Saisai Shao commented on SPARK-12516:
-------------------------------------
Hi [~vanzin], what is your suggestion of this issue? I'm failed to figure out a
proper solution to address this issue.
> Properly handle NM failure situation for Spark on Yarn
> ------------------------------------------------------
>
> Key: SPARK-12516
> URL: https://issues.apache.org/jira/browse/SPARK-12516
> Project: Spark
> Issue Type: Bug
> Components: YARN
> Affects Versions: 1.6.0
> Reporter: Saisai Shao
>
> Failure of NodeManager will make all the executors belong to that NM exit
> silently.
> Currently in the implementation of YarnSchedulerBackend, driver will receive
> onDisconnect event when executor is lost, which will further ask AM to get
> the lost reason, AM will hold this query connection until RM report back the
> status of lost container, and reply back to driver. In the case of NM
> failure, RM cannot detect this failure immediately until timeout (10 mins by
> default), so the driver query of lost reason will be timed out (120 seconds),
> after timed out the executor states in the driver side will be cleaned out,
> but in the AM side, this states will still be maintained until NM heartbeat
> timeout. So this will potentially introduce some unexpected behaviors:
> ---
> * In the dynamic allocation disabled situation, executor number in the driver
> side is less than the number in the AM side after timeout (from 120 seconds
> to 10 minutes), and cannot be ramped up to the expected number until RM
> detect the failure of NM and make the related containers as complected.
> {quote}
> For example the target executor number is 10, with 5 NMs (each NM has 2
> executors). So when 1 NM is failed, 2 related executors are lost. After
> driver side query timeout, the executor number in driver side is 8, but in AM
> side it is still 10, so AM will not request additional containers until the
> number in AM reaches to 8 (after 10 minutes).
> {quote}
> ---
> * When dynamic allocation is enabled, the number of target executor is
> maintained both in the driver and AM side and synced between them. The target
> executor number will be correct after driver query timeout (120 seconds), but
> this number is incorrect in the AM side until NM failure is detected (10
> minutes). In such case the actual executor number is less than the calculated
> one.
> {quote}
> For example, current target executor number in driver is N, and in AM side is
> M, so M - N is the lost number.
> When the executor number needs to ramp up to A, so the actual number will be
> A - (M - N).
> When the executor number needs to bring down to B, so the actual number will
> be max(0, B - (M - N)). when the actual number of executors is 0, the whole
> system is hang, will only be recovered if driver request more resources, or
> after 10 minutes timeout.
> This can be reproduced by running SparkPi example in the yarn-client mode
> with follow configurations:
> spark.dynamicAllocation.enabled true
> spark.shuffle.service.enabled true
> spark.dynamicAllocation.minExecutors 1
> spark.dynamicAllocation.initialExecutors 2
> spark.dynamicAllocation.maxExecutors 3
> In the middle of job, killing one NM which only has executors running.
> {quote}
> ---
> Possbile solutions:
> * Sync the actual executor number from the driver to AM after RPC timeout
> (120 seconds), also clean the related states in the AM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]