[ 
https://issues.apache.org/jira/browse/SPARK-12516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075693#comment-15075693
 ] 

Saisai Shao commented on SPARK-12516:
-------------------------------------

Hi [~vanzin], what is your suggestion of this issue? I'm failed to figure out a 
proper solution to address this issue.

> Properly handle NM failure situation for Spark on Yarn
> ------------------------------------------------------
>
>                 Key: SPARK-12516
>                 URL: https://issues.apache.org/jira/browse/SPARK-12516
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.0
>            Reporter: Saisai Shao
>
> Failure of NodeManager will make all the executors belong to that NM exit 
> silently.
> Currently in the implementation of YarnSchedulerBackend, driver will receive 
> onDisconnect event when executor is lost, which will further ask AM to get 
> the lost reason, AM will hold this query connection until RM report back the 
> status of lost container, and reply back to driver. In the case of NM 
> failure, RM cannot detect this failure immediately until timeout (10 mins by 
> default), so the driver query of lost reason will be timed out (120 seconds), 
> after timed out the executor states in the driver side will be cleaned out, 
> but in the AM side, this states will still be maintained until NM heartbeat 
> timeout. So this will potentially introduce some unexpected behaviors:
> ---
> * In the dynamic allocation disabled situation, executor number in the driver 
> side is less than the number in the AM side after timeout (from 120 seconds 
> to 10 minutes), and cannot be ramped up to the expected number until RM 
> detect the failure of NM and make the related containers as complected.
> {quote}
> For example the target executor number is 10, with 5 NMs (each NM has 2 
> executors). So when 1 NM is failed, 2 related executors are lost. After 
> driver side query timeout, the executor number in driver side is 8, but in AM 
> side it is still 10, so AM will not request additional containers until the 
> number in AM reaches to 8 (after 10 minutes).
> {quote}
> ---
> * When dynamic allocation is enabled, the number of target executor is 
> maintained both in the driver and AM side and synced between them. The target 
> executor number will be correct after driver query timeout (120 seconds), but 
> this number is incorrect in the AM side until NM failure is detected (10 
> minutes). In such case the actual executor number is less than the calculated 
> one.
> {quote}
> For example, current target executor number in driver is N, and in AM side is 
> M, so M - N is the lost number.
> When the executor number needs to ramp up to A, so the actual number will be 
> A - (M - N).
> When the executor number needs to bring down to B, so the actual number will 
> be max(0, B - (M - N)). when the actual number of executors is 0, the whole 
> system is hang, will only be recovered if driver request more resources, or 
> after 10 minutes timeout.
> This can be reproduced by running SparkPi example in the yarn-client mode 
> with follow configurations:
> spark.dynamicAllocation.enabled    true
> spark.shuffle.service.enabled      true
> spark.dynamicAllocation.minExecutors 1
> spark.dynamicAllocation.initialExecutors 2
> spark.dynamicAllocation.maxExecutors 3
> In the middle of job, killing one NM which only has executors running.
> {quote}
> ---
> Possbile solutions:
> * Sync the actual executor number from the driver to AM after RPC timeout 
> (120 seconds), also clean the related states in the AM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to