[ https://issues.apache.org/jira/browse/SPARK-12516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075693#comment-15075693 ]
Saisai Shao commented on SPARK-12516: ------------------------------------- Hi [~vanzin], what is your suggestion of this issue? I'm failed to figure out a proper solution to address this issue. > Properly handle NM failure situation for Spark on Yarn > ------------------------------------------------------ > > Key: SPARK-12516 > URL: https://issues.apache.org/jira/browse/SPARK-12516 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.6.0 > Reporter: Saisai Shao > > Failure of NodeManager will make all the executors belong to that NM exit > silently. > Currently in the implementation of YarnSchedulerBackend, driver will receive > onDisconnect event when executor is lost, which will further ask AM to get > the lost reason, AM will hold this query connection until RM report back the > status of lost container, and reply back to driver. In the case of NM > failure, RM cannot detect this failure immediately until timeout (10 mins by > default), so the driver query of lost reason will be timed out (120 seconds), > after timed out the executor states in the driver side will be cleaned out, > but in the AM side, this states will still be maintained until NM heartbeat > timeout. So this will potentially introduce some unexpected behaviors: > --- > * In the dynamic allocation disabled situation, executor number in the driver > side is less than the number in the AM side after timeout (from 120 seconds > to 10 minutes), and cannot be ramped up to the expected number until RM > detect the failure of NM and make the related containers as complected. > {quote} > For example the target executor number is 10, with 5 NMs (each NM has 2 > executors). So when 1 NM is failed, 2 related executors are lost. After > driver side query timeout, the executor number in driver side is 8, but in AM > side it is still 10, so AM will not request additional containers until the > number in AM reaches to 8 (after 10 minutes). > {quote} > --- > * When dynamic allocation is enabled, the number of target executor is > maintained both in the driver and AM side and synced between them. The target > executor number will be correct after driver query timeout (120 seconds), but > this number is incorrect in the AM side until NM failure is detected (10 > minutes). In such case the actual executor number is less than the calculated > one. > {quote} > For example, current target executor number in driver is N, and in AM side is > M, so M - N is the lost number. > When the executor number needs to ramp up to A, so the actual number will be > A - (M - N). > When the executor number needs to bring down to B, so the actual number will > be max(0, B - (M - N)). when the actual number of executors is 0, the whole > system is hang, will only be recovered if driver request more resources, or > after 10 minutes timeout. > This can be reproduced by running SparkPi example in the yarn-client mode > with follow configurations: > spark.dynamicAllocation.enabled true > spark.shuffle.service.enabled true > spark.dynamicAllocation.minExecutors 1 > spark.dynamicAllocation.initialExecutors 2 > spark.dynamicAllocation.maxExecutors 3 > In the middle of job, killing one NM which only has executors running. > {quote} > --- > Possbile solutions: > * Sync the actual executor number from the driver to AM after RPC timeout > (120 seconds), also clean the related states in the AM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org