[ 
https://issues.apache.org/jira/browse/SPARK-16533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15456416#comment-15456416
 ] 

Apache Spark commented on SPARK-16533:
--------------------------------------

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/14925

> Spark application not handling preemption messages
> --------------------------------------------------
>
>                 Key: SPARK-16533
>                 URL: https://issues.apache.org/jira/browse/SPARK-16533
>             Project: Spark
>          Issue Type: Bug
>          Components: EC2, Input/Output, Optimizer, Scheduler, Spark Submit, 
> YARN
>    Affects Versions: 1.6.0
>         Environment: Yarn version: Hadoop 2.7.1-amzn-0
> AWS EMR Cluster running:
> 1 x r3.8xlarge (Master)
> 52 x r3.8xlarge (Core)
> Spark version : 1.6.0
> Scala version: 2.10.5
> Java version: 1.8.0_51
> Input size: ~10 tb
> Input coming from S3
> Queue Configuration:
> Dynamic allocation: enabled
> Preemption: enabled
> Q1: 70% capacity with max of 100%
> Q2: 30% capacity with max of 100%
> Job Configuration:
> Driver memory = 10g
> Executor cores = 6
> Executor memory = 10g
> Deploy mode = cluster
> Master = yarn
> maxResultSize = 4g
> Shuffle manager = hash
>            Reporter: Lucas Winkelmann
>            Assignee: Angus Gerry
>             Fix For: 2.1.0
>
>
> Here is the scenario:
> I launch job 1 into Q1 and allow it to grow to 100% cluster utilization.
> I wait between 15-30 mins ( for this job to complete with 100% of the cluster 
> available takes about 1hr so job 1 is between 25-50% complete). Note that if 
> I wait less time then the issue sometimes does not occur, it appears to be 
> only after the job 1 is at least 25% complete.
> I launch job 2 into Q2 and preemption occurs on the Q1 shrinking the job to 
> allow 70% of cluster utilization.
> At this point job 1 basically halts progress while job 2 continues to execute 
> as normal and finishes. Job 2 either:
> - Fails its attempt and restarts. By the time this attempt fails the other 
> job is already complete meaning the second attempt has full cluster 
> availability and finishes.
> - The job remains at its current progress and simply does not finish ( I have 
> waited ~6 hrs until finally killing the application ).
>  
> Looking into the error log there is this constant error message:
> WARN NettyRpcEndpointRef: Error sending message [message = 
> RemoveExecutor(454,Container container_1468422920649_0001_01_000594 on host: 
> ip-NUMBERS.ec2.internal was preempted.)] in X attempts
>  
> My observations have led me to believe that the application master does not 
> know about this container being killed and continuously asks the container to 
> remove the executor until eventually failing the attempt or continue trying 
> to remove the executor.
>  
> I have done much digging online for anyone else experiencing this issue but 
> have come up with nothing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to