[ 
https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686
 ] 

Mridul Muralidharan commented on SPARK-6050:
--------------------------------------------


Thanks to [~tgraves] for helping investigate this.

There are multiple issues in the codebase - and not all of them have been fully 
understood.

a) For some reason, either YARN returns incorrect response to an allocate 
request or we are not setting the right param.
Note the snippet [1] to detail this.
(I cant share the logs unfortunately - but Tom has access to it and should be 
trivial for others to reproduce the issue).

b) For whatever reason (a) happens, we do not recover from it.
All subsequent requests heartbeat requests DO NOT contain pending allocation 
requests (and we have rejected/de-allocated whatever yarn just sent us due to 
(a)).

To elaborate; updateResourceRequests has missing == 0 since it is relying on 
getNumPendingAllocate() - which DOES NOT do the right thing in our context. 
Note: the 'ask' list in the super class was cleared as part of the previous 
allocate() call.


Essentially we were defending against these sort of corner cases in our code 
earlier - but the move to depend on AMRMClientImpl and the subsequent changes 
to it from under us has caused these problems for spark.


Fixing (a) will mask (b) - but IMO we should address it at the earliest too.




[1] Not the vCore in the response, and the subsequent ignoring of all 
containers.
15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, 
each with 8 cores and 38912 MB memory including 10240 MB overhead
15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: 
<memory:38912, vCores:8>)
15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - 
sleep time : 5000
15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress
15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, 
numPendingAllocate = 1000, numExecutorsRunning = 0
15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current 
executor count: 0. Cluster resources: <memory:43006976, vCores:-1000>.
15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that 
were allocated to us
15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, 
launching executors on 0 of them.

> Spark on YARN does not work --executor-cores is specified
> ---------------------------------------------------------
>
>                 Key: SPARK-6050
>                 URL: https://issues.apache.org/jira/browse/SPARK-6050
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.0
>         Environment: 2.5 based YARN cluster.
>            Reporter: Mridul Muralidharan
>            Priority: Blocker
>
> There are multiple issues here (which I will detail as comments), but to 
> reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC
> ./bin/spark-submit --class org.apache.spark.examples.SparkPi     --master 
> yarn-cluster --executor-cores 8    --num-executors 15     --driver-memory 4g  
>    --executor-memory 2g          --queue webmap     lib/spark-examples*.jar   
>   10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to