Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

谢廷稳 Tue, 24 Nov 2015 01:13:08 -0800

@Sab Thank you for your reply, but the cluster has 6 nodes which contain
300 cores and Spark application did not request resource from YARN.


@SaiSai I have ran it successful with "
spark.dynamicAllocation.initialExecutors"  equals 50, but in
http://spark.apache.org/docs/latest/configuration.html#dynamic-allocation
it says that

"spark.dynamicAllocation.initialExecutors" equals "
spark.dynamicAllocation.minExecutors". So, I think something was wrong, did
it?

Thanks.



2015-11-24 16:47 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:

> Did you set this configuration "spark.dynamicAllocation.initialExecutors"
> ?
>
> You can set spark.dynamicAllocation.initialExecutors 50 to take try again.
>
> I guess you might be hitting this issue since you're running 1.5.0,
> https://issues.apache.org/jira/browse/SPARK-9092. But it still cannot
> explain why 49 executors can be worked.
>
> On Tue, Nov 24, 2015 at 4:42 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> If yarn has only 50 cores then it can support max 49 executors plus 1
>> driver application master.
>>
>> Regards
>> Sab
>> On 24-Nov-2015 1:58 pm, "谢廷稳" <xieting...@gmail.com> wrote:
>>
>>> OK, yarn.scheduler.maximum-allocation-mb is 16384.
>>>
>>> I have ran it again, the command to run it is:
>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
>>> yarn-cluster -
>>> -driver-memory 4g  --executor-memory 8g lib/spark-examples*.jar 200
>>>
>>>
>>>
>>>>
>>>>
>>>> 15/11/24 16:15:56 INFO yarn.ApplicationMaster: Registered signal handlers 
>>>> for [TERM, HUP, INT]
>>>>
>>>> 15/11/24 16:15:57 INFO yarn.ApplicationMaster: ApplicationAttemptId: 
>>>> appattempt_1447834709734_0120_000001
>>>>
>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>>> hdfs-test
>>>>
>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>>> hdfs-test
>>>>
>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>>> authentication disabled; ui acls disabled; users with view permissions: 
>>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>>>
>>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Starting the user 
>>>> application in a separate Thread
>>>>
>>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>>> initialization
>>>>
>>>> 15/11/24 16:15:58 INFO yarn.ApplicationMaster: Waiting for spark context 
>>>> initialization ...
>>>> 15/11/24 16:15:58 INFO spark.SparkContext: Running Spark version 1.5.0
>>>>
>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing view acls to: 
>>>> hdfs-test
>>>>
>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: Changing modify acls to: 
>>>> hdfs-test
>>>>
>>>> 15/11/24 16:15:58 INFO spark.SecurityManager: SecurityManager: 
>>>> authentication disabled; ui acls disabled; users with view permissions: 
>>>> Set(hdfs-test); users with modify permissions: Set(hdfs-test)
>>>> 15/11/24 16:15:58 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>>> 15/11/24 16:15:59 INFO Remoting: Starting remoting
>>>>
>>>> 15/11/24 16:15:59 INFO Remoting: Remoting started; listening on addresses 
>>>> :[akka.tcp://sparkDriver@X.X.X.X
>>>> ]
>>>>
>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>>> 'sparkDriver' on port 61904.
>>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering MapOutputTracker
>>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering BlockManagerMaster
>>>>
>>>> 15/11/24 16:15:59 INFO storage.DiskBlockManager: Created local directory 
>>>> at 
>>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/blockmgr-33fbe6c4-5138-4eff-83b4-fb0c886667b7
>>>>
>>>> 15/11/24 16:15:59 INFO storage.MemoryStore: MemoryStore started with 
>>>> capacity 1966.1 MB
>>>>
>>>> 15/11/24 16:15:59 INFO spark.HttpFileServer: HTTP File server directory is 
>>>> /data1/hadoop/nm-local-dir/usercache/hdfs-test/appcache/application_1447834709734_0120/spark-fbbfa2bd-6d30-421e-a634-4546134b3b5f/httpd-e31d7b8e-ca8f-400e-8b4b-d2993fb6f1d1
>>>> 15/11/24 16:15:59 INFO spark.HttpServer: Starting HTTP Server
>>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>>> SocketConnector@0.0.0.0:14692
>>>>
>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'HTTP file 
>>>> server' on port 14692.
>>>>
>>>> 15/11/24 16:15:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>>>>
>>>> 15/11/24 16:15:59 INFO ui.JettyUtils: Adding filter: 
>>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
>>>> 15/11/24 16:15:59 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>> 15/11/24 16:15:59 INFO server.AbstractConnector: Started
>>>> SelectChannelConnector@0.0.0.0:15948
>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 'SparkUI' 
>>>> on port 15948.
>>>>
>>>> 15/11/24 16:15:59 INFO ui.SparkUI: Started SparkUI at X.X.X.X
>>>>
>>>> 15/11/24 16:15:59 INFO cluster.YarnClusterScheduler: Created 
>>>> YarnClusterScheduler
>>>>
>>>> 15/11/24 16:15:59 WARN metrics.MetricsSystem: Using default name 
>>>> DAGScheduler for source because
>>>> spark.app.id is not set.
>>>>
>>>> 15/11/24 16:15:59 INFO util.Utils: Successfully started service 
>>>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41830.
>>>> 15/11/24 16:15:59 INFO netty.NettyBlockTransferService: Server created on 
>>>> 41830
>>>>
>>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Trying to register 
>>>> BlockManager
>>>>
>>>> 15/11/24 16:15:59 INFO storage.BlockManagerMasterEndpoint: Registering 
>>>> block manager X.X.X.X:41830 with 1966.1 MB RAM, BlockManagerId(driver, 
>>>> 10.12.30.2, 41830)
>>>>
>>>>
>>>> 15/11/24 16:15:59 INFO storage.BlockManagerMaster: Registered BlockManager
>>>> 15/11/24 16:16:00 INFO scheduler.EventLoggingListener: Logging events to 
>>>> hdfs:///tmp/latest-spark-events/application_1447834709734_0120_1
>>>>
>>>> 15/11/24 16:16:00 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
>>>> ApplicationMaster registered as 
>>>> AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#293602859])
>>>>
>>>> 15/11/24 16:16:00 INFO client.RMProxy: Connecting to ResourceManager at 
>>>> X.X.X.X
>>>>
>>>>
>>>> 15/11/24 16:16:00 INFO yarn.YarnRMClient: Registering the ApplicationMaster
>>>>
>>>> 15/11/24 16:16:00 INFO yarn.ApplicationMaster: Started progress reporter 
>>>> thread with (heartbeat : 3000, initial allocation : 200) intervals
>>>>
>>>> 15/11/24 16:16:29 INFO cluster.YarnClusterSchedulerBackend: 
>>>> SchedulerBackend is ready for scheduling beginning after waiting 
>>>> maxRegisteredResourcesWaitingTime: 30000(ms)
>>>>
>>>> 15/11/24 16:16:29 INFO cluster.YarnClusterScheduler: 
>>>> YarnClusterScheduler.postStartHook done
>>>>
>>>> 15/11/24 16:16:29 INFO spark.SparkContext: Starting job: reduce at 
>>>> SparkPi.scala:36
>>>>
>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Got job 0 (reduce at 
>>>> SparkPi.scala:36) with 200 output partitions
>>>>
>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Final stage: ResultStage 
>>>> 0(reduce at SparkPi.scala:36)
>>>>
>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Parents of final stage: 
>>>> List()
>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Missing parents: List()
>>>>
>>>> 15/11/24 16:16:29 INFO scheduler.DAGScheduler: Submitting ResultStage 0 
>>>> (MapPartitionsRDD[1] at map at SparkPi.scala:32), which has no missing 
>>>> parents
>>>>
>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1888) called 
>>>> with curMem=0, maxMem=2061647216
>>>>
>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0 stored as 
>>>> values in memory (estimated size 1888.0 B, free 1966.1 MB)
>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: ensureFreeSpace(1202) called 
>>>> with curMem=1888, maxMem=2061647216
>>>>
>>>> 15/11/24 16:16:30 INFO storage.MemoryStore: Block broadcast_0_piece0 
>>>> stored as bytes in memory (estimated size 1202.0 B, free 1966.1 MB)
>>>>
>>>> 15/11/24 16:16:30 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 
>>>> in memory on X.X.X.X:41830 (size: 1202.0 B, free: 1966.1 MB)
>>>>
>>>>
>>>> 15/11/24 16:16:30 INFO spark.SparkContext: Created broadcast 0 from 
>>>> broadcast at DAGScheduler.scala:861
>>>>
>>>> 15/11/24 16:16:30 INFO scheduler.DAGScheduler: Submitting 200 missing 
>>>> tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:32)
>>>>
>>>> 15/11/24 16:16:30 INFO cluster.YarnClusterScheduler: Adding task set 0.0 
>>>> with 200 tasks
>>>>
>>>> 15/11/24 16:16:45 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient resources
>>>>
>>>> 15/11/24 16:17:00 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient resources
>>>>
>>>> 15/11/24 16:17:15 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient resources
>>>>
>>>> 15/11/24 16:17:30 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient resources
>>>>
>>>> 15/11/24 16:17:45 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient resources
>>>>
>>>> 15/11/24 16:18:00 WARN cluster.YarnClusterScheduler: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient resources
>>>>
>>>>
>>> 2015-11-24 15:14 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>
>>>> What about this configure in Yarn "yarn.scheduler.maximum-allocation-mb
>>>> "
>>>>
>>>> I'm curious why 49 executors can be worked, but 50 failed. Would you
>>>> provide your application master log, if container request is issued, there
>>>> will be log like:
>>>>
>>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Will request 2 executor
>>>> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
>>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host:
>>>> Any, capability: <memory:1408, vCores:1>)
>>>> 15/10/14 17:35:37 INFO yarn.YarnAllocator: Container request (host:
>>>> Any, capability: <memory:1408, vCores:1>)
>>>>
>>>>
>>>>
>>>> On Tue, Nov 24, 2015 at 2:56 PM, 谢廷稳 <xieting...@gmail.com> wrote:
>>>>
>>>>> OK,  the YARN conf will be list in the following:
>>>>>
>>>>> yarn.nodemanager.resource.memory-mb:115200
>>>>> yarn.nodemanager.resource.cpu-vcores:50
>>>>>
>>>>> I think the YARN resource is sufficient. In the previous letter I
>>>>> have said that I think Spark application didn't request resources
>>>>> from YARN.
>>>>>
>>>>> Thanks
>>>>>
>>>>> 2015-11-24 14:30 GMT+08:00 cherrywayb...@gmail.com <
>>>>> cherrywayb...@gmail.com>:
>>>>>
>>>>>> can you show your parameter values in your env ?
>>>>>>     yarn.nodemanager.resource.cpu-vcores
>>>>>>     yarn.nodemanager.resource.memory-mb
>>>>>>
>>>>>> ------------------------------
>>>>>> cherrywayb...@gmail.com
>>>>>>
>>>>>>
>>>>>> *From:* 谢廷稳 <xieting...@gmail.com>
>>>>>> *Date:* 2015-11-24 12:13
>>>>>> *To:* Saisai Shao <sai.sai.s...@gmail.com>
>>>>>> *CC:* spark users <user@spark.apache.org>
>>>>>> *Subject:* Re: A Problem About Running Spark 1.5 on YARN with
>>>>>> Dynamic Alloction
>>>>>> OK,the YARN cluster was used by myself,it have 6 node witch can run
>>>>>> over 100 executor, and the YARN RM logs showed that the Spark application
>>>>>> did not requested resource from it.
>>>>>>
>>>>>> Is this a bug? Should I create a JIRA for this problem?
>>>>>>
>>>>>> 2015-11-24 12:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>
>>>>>>> OK, so this looks like your Yarn cluster  does not allocate
>>>>>>> containers which you supposed should be 50. Does the yarn cluster have
>>>>>>> enough resource after allocating AM container, if not, that is the 
>>>>>>> problem.
>>>>>>>
>>>>>>> The problem not lies in dynamic allocation from my guess of your
>>>>>>> description. I said I'm OK with min and max executors to the same 
>>>>>>> number.
>>>>>>>
>>>>>>> On Tue, Nov 24, 2015 at 11:54 AM, 谢廷稳 <xieting...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Saisai,
>>>>>>>> I'm sorry for did not describe it clearly,YARN debug log said I
>>>>>>>> have 50 executors,but ResourceManager showed that I only have 1 
>>>>>>>> container
>>>>>>>> for the AppMaster.
>>>>>>>>
>>>>>>>> I have checked YARN RM logs,after AppMaster changed state
>>>>>>>> from ACCEPTED to RUNNING,it did not have log about this job any 
>>>>>>>> more.So,the
>>>>>>>> problem is I did not have any executor but ExecutorAllocationManager 
>>>>>>>> think
>>>>>>>> I have.Would you minding having a test in your cluster environment?
>>>>>>>> Thanks,
>>>>>>>> Weber
>>>>>>>>
>>>>>>>> 2015-11-24 11:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>>>
>>>>>>>>> I think this behavior is expected, since you already have 50
>>>>>>>>> executors launched, so no need to acquire additional executors. You 
>>>>>>>>> change
>>>>>>>>> is not solid, it is just hiding the log.
>>>>>>>>>
>>>>>>>>> Again I think you should check the logs of Yarn and Spark to see
>>>>>>>>> if executors are started correctly. Why resource is still not enough 
>>>>>>>>> where
>>>>>>>>> you already have 50 executors.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 24, 2015 at 10:48 AM, 谢廷稳 <xieting...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi SaiSai,
>>>>>>>>>> I have changed  "if (numExecutorsTarget >= maxNumExecutors)"  to
>>>>>>>>>> "if (numExecutorsTarget > maxNumExecutors)" of the first line in the
>>>>>>>>>> ExecutorAllocationManager#addExecutors() and it rans well.
>>>>>>>>>> In my opinion,when I was set minExecutors equals
>>>>>>>>>> maxExecutors,when the first time to add Executors,numExecutorsTarget
>>>>>>>>>> equals maxNumExecutors and it repeat printe "DEBUG
>>>>>>>>>> ExecutorAllocationManager: Not adding executors because our current 
>>>>>>>>>> target
>>>>>>>>>> total is already 50 (limit 50)".
>>>>>>>>>> Thanks
>>>>>>>>>> Weber
>>>>>>>>>>
>>>>>>>>>> 2015-11-23 21:00 GMT+08:00 Saisai Shao <sai.sai.s...@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> Hi Tingwen,
>>>>>>>>>>>
>>>>>>>>>>> Would you minding sharing your changes in
>>>>>>>>>>> ExecutorAllocationManager#addExecutors().
>>>>>>>>>>>
>>>>>>>>>>> From my understanding and test, dynamic allocation can be worked
>>>>>>>>>>> when you set the min to max number of executors to the same number.
>>>>>>>>>>>
>>>>>>>>>>> Please check your Spark and Yarn log to make sure the executors
>>>>>>>>>>> are correctly started, the warning log means currently resource is 
>>>>>>>>>>> not
>>>>>>>>>>> enough to submit tasks.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Saisai
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Nov 23, 2015 at 8:41 PM, 谢廷稳 <xieting...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> I ran a SparkPi on YARN with Dynamic Allocation enabled and set 
>>>>>>>>>>>> spark.dynamicAllocation.maxExecutors
>>>>>>>>>>>> equals
>>>>>>>>>>>> spark.dynamicAllocation.minExecutors,then I submit an
>>>>>>>>>>>> application using:
>>>>>>>>>>>> ./bin/spark-submit --class org.apache.spark.examples.SparkPi
>>>>>>>>>>>> --master yarn-cluster --driver-memory 4g --executor-memory 8g
>>>>>>>>>>>> lib/spark-examples*.jar 200
>>>>>>>>>>>>
>>>>>>>>>>>> then, this application was submitted successfully, but the
>>>>>>>>>>>> AppMaster always saying “15/11/23 20:13:08 WARN
>>>>>>>>>>>> cluster.YarnClusterScheduler: Initial job has not accepted any 
>>>>>>>>>>>> resources;
>>>>>>>>>>>> check your cluster UI to ensure that workers are registered and 
>>>>>>>>>>>> have
>>>>>>>>>>>> sufficient resources”
>>>>>>>>>>>> and when I open DEBUG,I found “15/11/23 20:24:00 DEBUG
>>>>>>>>>>>> ExecutorAllocationManager: Not adding executors because our 
>>>>>>>>>>>> current target
>>>>>>>>>>>> total is already 50 (limit 50)” in the console.
>>>>>>>>>>>>
>>>>>>>>>>>> I have fixed it by modifying code in
>>>>>>>>>>>> ExecutorAllocationManager.addExecutors,Does this a bug or it was 
>>>>>>>>>>>> designed
>>>>>>>>>>>> that we can’t set maxExecutors equals minExecutors?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Weber
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Re: A Problem About Running Spark 1.5 on YARN with Dynamic Alloction

Reply via email to