[ https://issues.apache.org/jira/browse/SPARK-6050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14339686#comment-14339686 ]
Mridul Muralidharan commented on SPARK-6050: -------------------------------------------- Thanks to [~tgraves] for helping investigate this. There are multiple issues in the codebase - and not all of them have been fully understood. a) For some reason, either YARN returns incorrect response to an allocate request or we are not setting the right param. Note the snippet [1] to detail this. (I cant share the logs unfortunately - but Tom has access to it and should be trivial for others to reproduce the issue). b) For whatever reason (a) happens, we do not recover from it. All subsequent requests heartbeat requests DO NOT contain pending allocation requests (and we have rejected/de-allocated whatever yarn just sent us due to (a)). To elaborate; updateResourceRequests has missing == 0 since it is relying on getNumPendingAllocate() - which DOES NOT do the right thing in our context. Note: the 'ask' list in the super class was cleared as part of the previous allocate() call. Essentially we were defending against these sort of corner cases in our code earlier - but the move to depend on AMRMClientImpl and the subsequent changes to it from under us has caused these problems for spark. Fixing (a) will mask (b) - but IMO we should address it at the earliest too. [1] Not the vCore in the response, and the subsequent ignoring of all containers. 15/02/27 01:40:30 INFO YarnAllocator: Will request 1000 executor containers, each with 8 cores and 38912 MB memory including 10240 MB overhead 15/02/27 01:40:30 INFO YarnAllocator: Container request (host: Any, capability: <memory:38912, vCores:8>) 15/02/27 01:40:30 INFO ApplicationMaster: Started progress reporter thread - sleep time : 5000 15/02/27 01:40:30 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:30 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:35 DEBUG ApplicationMaster: Sending progress 15/02/27 01:40:35 INFO YarnAllocator: missing = 0, targetNumExecutors = 1000, numPendingAllocate = 1000, numExecutorsRunning = 0 15/02/27 01:40:36 DEBUG YarnAllocator: Allocated containers: 1000. Current executor count: 0. Cluster resources: <memory:43006976, vCores:-1000>. 15/02/27 01:40:36 DEBUG YarnAllocator: Releasing 1000 unneeded containers that were allocated to us 15/02/27 01:40:36 INFO YarnAllocator: Received 1000 containers from YARN, launching executors on 0 of them. > Spark on YARN does not work --executor-cores is specified > --------------------------------------------------------- > > Key: SPARK-6050 > URL: https://issues.apache.org/jira/browse/SPARK-6050 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.3.0 > Environment: 2.5 based YARN cluster. > Reporter: Mridul Muralidharan > Priority: Blocker > > There are multiple issues here (which I will detail as comments), but to > reproduce running the following ALWAYS hangs in our cluster with the 1.3 RC > ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master > yarn-cluster --executor-cores 8 --num-executors 15 --driver-memory 4g > --executor-memory 2g --queue webmap lib/spark-examples*.jar > 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org