@Sab Thank you for your reply, but the cluster has 6 nodes which contain 300 cores and Spark application did not request resource from YARN.
@SaiSai I have ran it successful with " spark.dynamicAllocation.initialExecutors" equals 50, but in http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation it says that "spark.dynamicAllocation.initialExecutors" equals " spark.dynamicAllocation.minExecutors". So, I think something was wrong, did it? Thanks. 2015-11-24 16:47 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: > Did you set this configuration "spark.dynamicAllocation.initialExecutors" > ? > > You can set spark.dynamicAllocation.initialExecutors 50 to take try again. > > I guess you might be hitting this issue since you're running 1.5.0, > https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot > explain why 49 executors can be worked. > > On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan < > sabarish.sasidha...@manthan.com> wrote: > >> If yarn has only 50 cores then it can support max 49 executors plus 1 >> driver application master. >> >> Regards >> Sab >> On 24-Nov-2015 1:58 pm, "谢廷稳" <xieting...@gmail.com> wrote: >> >>> OK, yarn.scheduler.maximum-allocation-mb is 16384. >>> >>> I have ran it again, the command to run it is: >>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master >>> yarn-cluster - >>> -driver-memory 4g --executor-memory 8g lib/spark-examples*.jar 200 >>> >>> >>> >>>> >>>> >>>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers >>>> for [TERM, HUP, INT] >>>> >>>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: >>>> appattempt_1447834709734_0120_000001 >>>> >>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: >>>> hdfs-test >>>> >>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: >>>> hdfs-test >>>> >>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: >>>> authentication disabled; ui acls disabled; users with view permissions: >>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test) >>>> >>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user >>>> application in a separate Thread >>>> >>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context >>>> initialization >>>> >>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context >>>> initialization ... >>>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0 >>>> >>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: >>>> hdfs-test >>>> >>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: >>>> hdfs-test >>>> >>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: >>>> authentication disabled; ui acls disabled; users with view permissions: >>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test) >>>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started >>>> 15/11/24 16:15:59 INFO Remoting: Starting remoting >>>> >>>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses >>>> :[akka.tcp://sparkDriver@X.X.X.X >>>> ] >>>> >>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service >>>> 'sparkDriver' on port 61904. >>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker >>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster >>>> >>>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory >>>> at >>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7 >>>> >>>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with >>>> capacity 1966.1 MB >>>> >>>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is >>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1 >>>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server >>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT >>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started >>>> SocketConnector@0.0.0.0:14692 >>>> >>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file >>>> server' on port 14692. >>>> >>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator >>>> >>>> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: >>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter >>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT >>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started >>>> SelectChannelConnector@0.0.0.0:15948 >>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' >>>> on port 15948. >>>> >>>> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X >>>> >>>> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created >>>> YarnClusterScheduler >>>> >>>> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name >>>> DAGScheduler for source because >>>> spark.app.id is not set. >>>> >>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service >>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830. >>>> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on >>>> 41830 >>>> >>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register >>>> BlockManager >>>> >>>> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering >>>> block manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, >>>> 10.12.30.2, 41830) >>>> >>>> >>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Registered BlockManager >>>> 15/11/24 16:16:00 INFO scheduler.EventLoggingListener: Logging events to >>>> hdfs:///tmp/latest-spark-events/application_1447834709734_0120_1 >>>> >>>> 15/11/24 16:16:00 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: >>>> ApplicationMaster registered as >>>> AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#293602859]) >>>> >>>> 15/11/24 16:16:00 INFO client.RMProxy: Connecting to ResourceManager at >>>> X.X.X.X >>>> >>>> >>>> 15/11/24 16:16:00 INFO yarn.YarnRMClient: Registering the ApplicationMaster >>>> >>>> 15/11/24 16:16:00 INFO yarn.ApplicationMaster: Started progress reporter >>>> thread with (heartbeat : 3000, initial allocation : 200) intervals >>>> >>>> 15/11/24 16:16:29 INFO cluster.YarnClusterSchedulerBackend: >>>> SchedulerBackend is ready for scheduling beginning after waiting >>>> maxRegisteredResourcesWaitingTime: 30000(ms) >>>> >>>> 15/11/24 16:16:29 INFO cluster.YarnClusterScheduler: >>>> YarnClusterScheduler.postStartHook done >>>> >>>> 15/11/24 16:16:29 INFO spark.SparkContext: Starting job: reduce at >>>> SparkPi.scala:36 >>>> >>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Got job 0 (reduce at >>>> SparkPi.scala:36) with 200 output partitions >>>> >>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Final stage: ResultStage >>>> 0(reduce at SparkPi.scala:36) >>>> >>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Parents of final stage: >>>> List() >>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Missing parents: List() >>>> >>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Submitting ResultStage 0 >>>> (MapPartitionsRDD[1] at map at SparkPi.scala:32), which has no missing >>>> parents >>>> >>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1888) called >>>> with curMem=0, maxMem=2061647216 >>>> >>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0 stored as >>>> values in memory (estimated size 1888.0 B, free 1966.1 MB) >>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1202) called >>>> with curMem=1888, maxMem=2061647216 >>>> >>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0_piece0 >>>> stored as bytes in memory (estimated size 1202.0 B, free 1966.1 MB) >>>> >>>> 15/11/24 16:16:30 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 >>>> in memory on X.X.X.X:41830 (size: 1202.0 B, free: 1966.1 MB) >>>> >>>> >>>> 15/11/24 16:16:30 INFO spark.SparkContext: Created broadcast 0 from >>>> broadcast at DAGScheduler.scala:861 >>>> >>>> 15/11/24 16:16:30 INFO scheduler.DAGScheduler: Submitting 200 missing >>>> tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:32) >>>> >>>> 15/11/24 16:16:30 INFO cluster.YarnClusterScheduler: Adding task set 0.0 >>>> with 200 tasks >>>> >>>> 15/11/24 16:16:45 WARN cluster.YarnClusterScheduler: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient resources >>>> >>>> 15/11/24 16:17:00 WARN cluster.YarnClusterScheduler: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient resources >>>> >>>> 15/11/24 16:17:15 WARN cluster.YarnClusterScheduler: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient resources >>>> >>>> 15/11/24 16:17:30 WARN cluster.YarnClusterScheduler: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient resources >>>> >>>> 15/11/24 16:17:45 WARN cluster.YarnClusterScheduler: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient resources >>>> >>>> 15/11/24 16:18:00 WARN cluster.YarnClusterScheduler: Initial job has not >>>> accepted any resources; check your cluster UI to ensure that workers are >>>> registered and have sufficient resources >>>> >>>> >>> 2015-11-24 15:14 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >>> >>>> What about this configure in Yarn "yarn.scheduler.maximum-allocation-mb >>>> " >>>> >>>> I'm curious why 49 executors can be worked, but 50 failed. Would you >>>> provide your application master log, if container request is issued, there >>>> will be log like: >>>> >>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Will request 2 executor >>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead >>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host: >>>> Any, capability: <memory:1408, vCores:1>) >>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host: >>>> Any, capability: <memory:1408, vCores:1>) >>>> >>>> >>>> >>>> On Tue, Nov 24, 2015 at 2:56 PM, 谢廷稳 <xieting...@gmail.com> wrote: >>>> >>>>> OK, the YARN conf will be list in the following: >>>>> >>>>> yarn.nodemanager.resource.memory-mb:115200 >>>>> yarn.nodemanager.resource.cpu-vcores:50 >>>>> >>>>> I think the YARN resource is sufficient. In the previous letter I >>>>> have said that I think Spark application didn't request resources >>>>> from YARN. >>>>> >>>>> Thanks >>>>> >>>>> 2015-11-24 14:30 GMT+08:00 cherrywayb...@gmail.com < >>>>> cherrywayb...@gmail.com>: >>>>> >>>>>> can you show your parameter values in your env ? >>>>>> yarn.nodemanager.resource.cpu-vcores >>>>>> yarn.nodemanager.resource.memory-mb >>>>>> >>>>>> ------------------------------ >>>>>> cherrywayb...@gmail.com >>>>>> >>>>>> >>>>>> *From:* 谢廷稳 <xieting...@gmail.com> >>>>>> *Date:* 2015-11-24 12:13 >>>>>> *To:* Saisai Shao <sai.sai.s...@gmail.com> >>>>>> *CC:* spark users <user@spark.apache.org> >>>>>> *Subject:* Re: A Problem About Running Spark 1.5 on YARN with >>>>>> Dynamic Alloction >>>>>> OK,the YARN cluster was used by myself,it have 6 node witch can run >>>>>> over 100 executor, and the YARN RM logs showed that the Spark application >>>>>> did not requested resource from it. >>>>>> >>>>>> Is this a bug? Should I create a JIRA for this problem? >>>>>> >>>>>> 2015-11-24 12:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >>>>>> >>>>>>> OK, so this looks like your Yarn cluster does not allocate >>>>>>> containers which you supposed should be 50. Does the yarn cluster have >>>>>>> enough resource after allocating AM container, if not, that is the >>>>>>> problem. >>>>>>> >>>>>>> The problem not lies in dynamic allocation from my guess of your >>>>>>> description. I said I'm OK with min and max executors to the same >>>>>>> number. >>>>>>> >>>>>>> On Tue, Nov 24, 2015 at 11:54 AM, 谢廷稳 <xieting...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Saisai, >>>>>>>> I'm sorry for did not describe it clearly,YARN debug log said I >>>>>>>> have 50 executors,but ResourceManager showed that I only have 1 >>>>>>>> container >>>>>>>> for the AppMaster. >>>>>>>> >>>>>>>> I have checked YARN RM logs,after AppMaster changed state >>>>>>>> from ACCEPTED to RUNNING,it did not have log about this job any >>>>>>>> more.So,the >>>>>>>> problem is I did not have any executor but ExecutorAllocationManager >>>>>>>> think >>>>>>>> I have.Would you minding having a test in your cluster environment? >>>>>>>> Thanks, >>>>>>>> Weber >>>>>>>> >>>>>>>> 2015-11-24 11:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >>>>>>>> >>>>>>>>> I think this behavior is expected, since you already have 50 >>>>>>>>> executors launched, so no need to acquire additional executors. You >>>>>>>>> change >>>>>>>>> is not solid, it is just hiding the log. >>>>>>>>> >>>>>>>>> Again I think you should check the logs of Yarn and Spark to see >>>>>>>>> if executors are started correctly. Why resource is still not enough >>>>>>>>> where >>>>>>>>> you already have 50 executors. >>>>>>>>> >>>>>>>>> On Tue, Nov 24, 2015 at 10:48 AM, 谢廷稳 <xieting...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi SaiSai, >>>>>>>>>> I have changed "if (numExecutorsTarget >= maxNumExecutors)" to >>>>>>>>>> "if (numExecutorsTarget > maxNumExecutors)" of the first line in the >>>>>>>>>> ExecutorAllocationManager#addExecutors() and it rans well. >>>>>>>>>> In my opinion,when I was set minExecutors equals >>>>>>>>>> maxExecutors,when the first time to add Executors,numExecutorsTarget >>>>>>>>>> equals maxNumExecutors and it repeat printe "DEBUG >>>>>>>>>> ExecutorAllocationManager: Not adding executors because our current >>>>>>>>>> target >>>>>>>>>> total is already 50 (limit 50)". >>>>>>>>>> Thanks >>>>>>>>>> Weber >>>>>>>>>> >>>>>>>>>> 2015-11-23 21:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> Hi Tingwen, >>>>>>>>>>> >>>>>>>>>>> Would you minding sharing your changes in >>>>>>>>>>> ExecutorAllocationManager#addExecutors(). >>>>>>>>>>> >>>>>>>>>>> From my understanding and test, dynamic allocation can be worked >>>>>>>>>>> when you set the min to max number of executors to the same number. >>>>>>>>>>> >>>>>>>>>>> Please check your Spark and Yarn log to make sure the executors >>>>>>>>>>> are correctly started, the warning log means currently resource is >>>>>>>>>>> not >>>>>>>>>>> enough to submit tasks. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Saisai >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Nov 23, 2015 at 8:41 PM, 谢廷稳 <xieting...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> I ran a SparkPi on YARN with Dynamic Allocation enabled and set >>>>>>>>>>>> spark.dynamicAllocation.maxExecutors >>>>>>>>>>>> equals >>>>>>>>>>>> spark.dynamicAllocation.minExecutors,then I submit an >>>>>>>>>>>> application using: >>>>>>>>>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi >>>>>>>>>>>> --master yarn-cluster --driver-memory 4g --executor-memory 8g >>>>>>>>>>>> lib/spark-examples*.jar 200 >>>>>>>>>>>> >>>>>>>>>>>> then, this application was submitted successfully, but the >>>>>>>>>>>> AppMaster always saying “15/11/23 20:13:08 WARN >>>>>>>>>>>> cluster.YarnClusterScheduler: Initial job has not accepted any >>>>>>>>>>>> resources; >>>>>>>>>>>> check your cluster UI to ensure that workers are registered and >>>>>>>>>>>> have >>>>>>>>>>>> sufficient resources” >>>>>>>>>>>> and when I open DEBUG,I found “15/11/23 20:24:00 DEBUG >>>>>>>>>>>> ExecutorAllocationManager: Not adding executors because our >>>>>>>>>>>> current target >>>>>>>>>>>> total is already 50 (limit 50)” in the console. >>>>>>>>>>>> >>>>>>>>>>>> I have fixed it by modifying code in >>>>>>>>>>>> ExecutorAllocationManager.addExecutors,Does this a bug or it was >>>>>>>>>>>> designed >>>>>>>>>>>> that we can’t set maxExecutors equals minExecutors? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Weber >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >