[ https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456416#comment-15456416 ]
Apache Spark commented on SPARK-16533: -------------------------------------- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/14925 > Spark application not handling preemption messages > -------------------------------------------------- > > Key: SPARK-16533 > URL: https://issues.apache.org/jira/browse/SPARK-16533 > Project: Spark > Issue Type: Bug > Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, > YARN > Affects Versions: 1.6.0 > Environment: Yarn version: Hadoop 2.7.1-amzn-0 > AWS EMR Cluster running: > 1 x r3.8xlarge (Master) > 52 x r3.8xlarge (Core) > Spark version : 1.6.0 > Scala version: 2.10.5 > Java version: 1.8.0_51 > Input size: ~10 tb > Input coming from S3 > Queue Configuration: > Dynamic allocation: enabled > Preemption: enabled > Q1: 70% capacity with max of 100% > Q2: 30% capacity with max of 100% > Job Configuration: > Driver memory = 10g > Executor cores = 6 > Executor memory = 10g > Deploy mode = cluster > Master = yarn > maxResultSize = 4g > Shuffle manager = hash > Reporter: Lucas Winkelmann > Assignee: Angus Gerry > Fix For: 2.1.0 > > > Here is the scenario: > I launch job 1 into Q1 and allow it to grow to 100% cluster utilization. > I wait between 15-30 mins ( for this job to complete with 100% of the cluster > available takes about 1hr so job 1 is between 25-50% complete). Note that if > I wait less time then the issue sometimes does not occur, it appears to be > only after the job 1 is at least 25% complete. > I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to > allow 70% of cluster utilization. > At this point job 1 basically halts progress while job 2 continues to execute > as normal and finishes. Job 2 either: > - Fails its attempt and restarts. By the time this attempt fails the other > job is already complete meaning the second attempt has full cluster > availability and finishes. > - The job remains at its current progress and simply does not finish ( I have > waited ~6 hrs until finally killing the application ). > > Looking into the error log there is this constant error message: > WARN NettyRpcEndpointRef: Error sending message [message = > RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: > ip-NUMBERS.ec2.internal was preempted.)] in X attempts > > My observations have led me to believe that the application master does not > know about this container being killed and continuously asks the container to > remove the executor until eventually failing the attempt or continue trying > to remove the executor. > > I have done much digging online for anyone else experiencing this issue but > have come up with nothing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org