Wow, I hadn't noticed this, but 5 seconds is really long. It's true that it's configurable, but I think we need to provide a decent out-of-the-box experience. For comparison, the MapReduce equivalent is 1 second.
I filed https://issues.apache.org/jira/browse/SPARK-7533 for this. -Sandy On Mon, May 11, 2015 at 9:03 AM, Mridul Muralidharan <mri...@gmail.com> wrote: > For tiny/small clusters (particularly single tenet), you can set it to > lower value. > But for anything reasonably large or multi-tenet, the request storm > can be bad if large enough number of applications start aggressively > polling RM. > That is why the interval is set to configurable. > > - Mridul > > > On Mon, May 11, 2015 at 6:54 AM, Zoltán Zvara <zoltan.zv...@gmail.com> > wrote: > > Isn't this issue something that should be improved? Based on the > following > > discussion, there are two places were YARN's heartbeat interval is > > respected on job start-up, but do we really need to respect it on > start-up? > > > > On Fri, May 8, 2015 at 12:14 PM Taeyun Kim <taeyun....@innowireless.com> > > wrote: > > > >> I think so. > >> > >> In fact, the flow is: allocator.allocateResources() -> sleep -> > >> allocator.allocateResources() -> sleep … > >> > >> But I guess that on the first allocateResources() the allocation is not > >> fulfilled. So sleep occurs. > >> > >> > >> > >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] > >> *Sent:* Friday, May 08, 2015 4:25 PM > >> > >> > >> *To:* Taeyun Kim; u...@spark.apache.org > >> *Subject:* Re: YARN mode startup takes too long (10+ secs) > >> > >> > >> > >> So is this sleep occurs before allocating resources for the first few > >> executors to start the job? > >> > >> > >> > >> On Fri, May 8, 2015 at 6:23 AM Taeyun Kim <taeyun....@innowireless.com> > >> wrote: > >> > >> I think I’ve found the (maybe partial, but major) reason. > >> > >> > >> > >> It’s between the following lines, (it’s newly captured, but essentially > >> the same place that Zoltán Zvara picked: > >> > >> > >> > >> 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager > >> > >> 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor: > >> Actor[akka.tcp://sparkExecutor@cluster04 > :55237/user/Executor#-149550753] > >> with ID 1 > >> > >> > >> > >> When I read the logs on cluster side, the following lines were found: > (the > >> exact time is different with above line, but it’s the difference between > >> machines) > >> > >> > >> > >> 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter > >> thread - sleep time : 5000 > >> > >> 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for : > >> cluster04:45454 > >> > >> > >> > >> It seemed that Spark deliberately sleeps 5 secs. > >> > >> I’ve read the Spark source code, and in > >> org.apache.spark.deploy.yarn.ApplicationMaster.scala, > launchReporterThread() > >> had the code for that. > >> > >> It loops calling allocator.allocateResources() and Thread.sleep(). > >> > >> For sleep, it reads the configuration variable > >> spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000, > >> which is 5 secs). > >> > >> According to the comment, “we want to be reasonably responsive without > >> causing too many requests to RM”. > >> > >> So, unless YARN immediately fulfill the allocation request, it seems > that > >> 5 secs will be wasted. > >> > >> > >> > >> When I modified the configuration variable to 1000, it only waited for 1 > >> sec. > >> > >> > >> > >> Here is the log lines after the change: > >> > >> > >> > >> 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter > >> thread - sleep time : 1000 > >> > >> 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for : > >> cluster04:45454 > >> > >> > >> > >> 4 secs saved. > >> > >> So, when one does not want to wait 5 secs, one can change the > >> spark.yarn.scheduler.heartbeat.interval-ms. > >> > >> I hope that the additional overhead it incurs would be negligible. > >> > >> > >> > >> > >> > >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com] > >> *Sent:* Thursday, May 07, 2015 10:05 PM > >> *To:* Taeyun Kim; u...@spark.apache.org > >> *Subject:* Re: YARN mode startup takes too long (10+ secs) > >> > >> > >> > >> Without considering everything, just a few hints: > >> > >> You are running on YARN. From 09:18:34 to 09:18:37 your application is > in > >> state ACCEPTED. There is a noticeable overhead introduced due to > >> communicating with YARN's ResourceManager, NodeManager and given that > the > >> YARN scheduler needs time to make a decision. I guess somewhere > >> from 09:18:38 to 09:18:43 your application JAR gets copied to another > >> container requested by the Spark ApplicationMaster deployed on YARN's > >> container 0. Deploying an executor needs further resource negotiations > with > >> the ResourceManager usually. Also, as I said, your JAR and Executor's > code > >> requires copying to the container's local directory - execution blocked > >> until that is complete. > >> > >> > >> > >> On Thu, May 7, 2015 at 3:09 AM Taeyun Kim <taeyun....@innowireless.com> > >> wrote: > >> > >> Hi, > >> > >> > >> > >> I’m running a spark application with YARN-client or YARN-cluster mode. > >> > >> But it seems to take too long to startup. > >> > >> It takes 10+ seconds to initialize the spark context. > >> > >> Is this normal? Or can it be optimized? > >> > >> > >> > >> The environment is as follows: > >> > >> - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6) > >> > >> - Spark: 1.3.1 > >> > >> - Client: Windows 7, but similar result on CentOS 6.6 > >> > >> > >> > >> The following is the startup part of the application log. (Some private > >> information was edited) > >> > >> ‘Main: Initializing context’ at the first line and ‘MainProcessor: > >> Deleting previous output files’ at the last line are the logs by the > >> application. Others in between are from Spark itself. Application logic > is > >> executed after this log is displayed. > >> > >> > >> > >> --- > >> > >> > >> > >> 15/05/07 09:18:31 INFO Main: Initializing context > >> > >> 15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1 > >> > >> 15/05/07 09:18:31 INFO SecurityManager: Changing view acls to: > myuser,myapp > >> > >> 15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to: > >> myuser,myapp > >> > >> 15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication > >> disabled; ui acls disabled; users with view permissions: Set(myuser, > >> myapp); users with modify permissions: Set(myuser, myapp) > >> > >> 15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started > >> > >> 15/05/07 09:18:31 INFO Remoting: Starting remoting > >> > >> 15/05/07 09:18:31 INFO Remoting: Remoting started; listening on > addresses > >> :[akka.tcp://sparkDriver@mymachine:54449] > >> > >> 15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver' > >> on port 54449. > >> > >> 15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker > >> > >> 15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster > >> > >> 15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at > >> > C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772 > >> > >> 15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity > >> 1956.7 MB > >> > >> 15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is > >> > C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9 > >> > >> 15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server > >> > >> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT > >> > >> 15/05/07 09:18:32 INFO AbstractConnector: Started > >> SocketConnector@0.0.0.0:54450 > >> > >> 15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file > >> server' on port 54450. > >> > >> 15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator > >> > >> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT > >> > >> 15/05/07 09:18:32 INFO AbstractConnector: Started > >> SelectChannelConnector@0.0.0.0:4040 > >> > >> 15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on > >> port 4040. > >> > >> 15/05/07 09:18:32 INFO SparkUI: Started SparkUI at > http://mymachine:4040 > >> > >> 15/05/07 09:18:32 INFO SparkContext: Added JAR > >> file:/D:/Projects/MyApp/MyApp.jar at > >> http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240 > >> > >> 15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at > cluster01/ > >> 10.111.111.11:8050 > >> > >> 15/05/07 09:18:32 INFO Client: Requesting a new application from cluster > >> with 3 NodeManagers > >> > >> 15/05/07 09:18:32 INFO Client: Verifying our application has not > requested > >> more than the maximum memory capability of the cluster (23040 MB per > >> container) > >> > >> 15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB > >> memory including 384 MB overhead > >> > >> 15/05/07 09:18:32 INFO Client: Setting up container launch context for > our > >> AM > >> > >> 15/05/07 09:18:32 INFO Client: Preparing resources for our AM container > >> > >> 15/05/07 09:18:32 INFO Client: Source and destination file systems are > the > >> same. Not copying > >> hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar > >> > >> 15/05/07 09:18:32 INFO Client: Setting up the launch environment for our > >> AM container > >> > >> 15/05/07 09:18:33 INFO SecurityManager: Changing view acls to: > myuser,myapp > >> > >> 15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to: > >> myuser,myapp > >> > >> 15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication > >> disabled; ui acls disabled; users with view permissions: Set(myuser, > >> myapp); users with modify permissions: Set(myuser, myapp) > >> > >> 15/05/07 09:18:33 INFO Client: Submitting application 2 to > ResourceManager > >> > >> 15/05/07 09:18:33 INFO YarnClientImpl: Submitted application > >> application_1430956687773_0002 > >> > >> 15/05/07 09:18:34 INFO Client: Application report for > >> application_1430956687773_0002 (state: ACCEPTED) > >> > >> 15/05/07 09:18:34 INFO Client: > >> > >> client token: N/A > >> > >> diagnostics: N/A > >> > >> ApplicationMaster host: N/A > >> > >> ApplicationMaster RPC port: -1 > >> > >> queue: default > >> > >> start time: 1430957906540 > >> > >> final status: UNDEFINED > >> > >> tracking URL: > >> http://cluster01:8088/proxy/application_1430956687773_0002/ > >> > >> user: myapp > >> > >> 15/05/07 09:18:35 INFO Client: Application report for > >> application_1430956687773_0002 (state: ACCEPTED) > >> > >> 15/05/07 09:18:36 INFO Client: Application report for > >> application_1430956687773_0002 (state: ACCEPTED) > >> > >> 15/05/07 09:18:37 INFO Client: Application report for > >> application_1430956687773_0002 (state: ACCEPTED) > >> > >> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster > >> registered as Actor[akka.tcp://sparkYarnAM@cluster02 > >> :39698/user/YarnAM#-1579648782] > >> > >> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter. > >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, > Map(PROXY_HOSTS > >> -> cluster01, PROXY_URI_BASES -> > >> http://cluster01:8088/proxy/application_1430956687773_0002), > >> /proxy/application_1430956687773_0002 > >> > >> 15/05/07 09:18:37 INFO JettyUtils: Adding filter: > >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter > >> > >> 15/05/07 09:18:38 INFO Client: Application report for > >> application_1430956687773_0002 (state: RUNNING) > >> > >> 15/05/07 09:18:38 INFO Client: > >> > >> client token: N/A > >> > >> diagnostics: N/A > >> > >> ApplicationMaster host: cluster02 > >> > >> ApplicationMaster RPC port: 0 > >> > >> queue: default > >> > >> start time: 1430957906540 > >> > >> final status: UNDEFINED > >> > >> tracking URL: > >> http://cluster01:8088/proxy/application_1430956687773_0002/ > >> > >> user: myapp > >> > >> 15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application > >> application_1430956687773_0002 has started running. > >> > >> 15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on > 54491 > >> > >> 15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register > BlockManager > >> > >> 15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block > manager > >> mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine, > >> 54491) > >> > >> 15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager > >> > >> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor: > >> Actor[akka.tcp://sparkExecutor@cluster02 > :44996/user/Executor#-786778979] > >> with ID 1 > >> > >> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is > >> ready for scheduling beginning after reached > minRegisteredResourcesRatio: > >> 0.8 > >> > >> 15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files > >> > >> > >> > >> Thanks. > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >