Wow, I hadn't noticed this, but 5 seconds is really long.  It's true that
it's configurable, but I think we need to provide a decent out-of-the-box
experience.  For comparison, the MapReduce equivalent is 1 second.

I filed https://issues.apache.org/jira/browse/SPARK-7533 for this.

-Sandy

On Mon, May 11, 2015 at 9:03 AM, Mridul Muralidharan <mri...@gmail.com>
wrote:

> For tiny/small clusters (particularly single tenet), you can set it to
> lower value.
> But for anything reasonably large or multi-tenet, the request storm
> can be bad if large enough number of applications start aggressively
> polling RM.
> That is why the interval is set to configurable.
>
> - Mridul
>
>
> On Mon, May 11, 2015 at 6:54 AM, Zoltán Zvara <zoltan.zv...@gmail.com>
> wrote:
> > Isn't this issue something that should be improved? Based on the
> following
> > discussion, there are two places were YARN's heartbeat interval is
> > respected on job start-up, but do we really need to respect it on
> start-up?
> >
> > On Fri, May 8, 2015 at 12:14 PM Taeyun Kim <taeyun....@innowireless.com>
> > wrote:
> >
> >> I think so.
> >>
> >> In fact, the flow is: allocator.allocateResources() -> sleep ->
> >> allocator.allocateResources() -> sleep …
> >>
> >> But I guess that on the first allocateResources() the allocation is not
> >> fulfilled. So sleep occurs.
> >>
> >>
> >>
> >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
> >> *Sent:* Friday, May 08, 2015 4:25 PM
> >>
> >>
> >> *To:* Taeyun Kim; u...@spark.apache.org
> >> *Subject:* Re: YARN mode startup takes too long (10+ secs)
> >>
> >>
> >>
> >> So is this sleep occurs before allocating resources for the first few
> >> executors to start the job?
> >>
> >>
> >>
> >> On Fri, May 8, 2015 at 6:23 AM Taeyun Kim <taeyun....@innowireless.com>
> >> wrote:
> >>
> >> I think I’ve found the (maybe partial, but major) reason.
> >>
> >>
> >>
> >> It’s between the following lines, (it’s newly captured, but essentially
> >> the same place that Zoltán Zvara picked:
> >>
> >>
> >>
> >> 15/05/08 11:36:32 INFO BlockManagerMaster: Registered BlockManager
> >>
> >> 15/05/08 11:36:38 INFO YarnClientSchedulerBackend: Registered executor:
> >> Actor[akka.tcp://sparkExecutor@cluster04
> :55237/user/Executor#-149550753]
> >> with ID 1
> >>
> >>
> >>
> >> When I read the logs on cluster side, the following lines were found:
> (the
> >> exact time is different with above line, but it’s the difference between
> >> machines)
> >>
> >>
> >>
> >> 15/05/08 11:36:23 INFO yarn.ApplicationMaster: Started progress reporter
> >> thread - sleep time : 5000
> >>
> >> 15/05/08 11:36:28 INFO impl.AMRMClientImpl: Received new token for :
> >> cluster04:45454
> >>
> >>
> >>
> >> It seemed that Spark deliberately sleeps 5 secs.
> >>
> >> I’ve read the Spark source code, and in
> >> org.apache.spark.deploy.yarn.ApplicationMaster.scala,
> launchReporterThread()
> >> had the code for that.
> >>
> >> It loops calling allocator.allocateResources() and Thread.sleep().
> >>
> >> For sleep, it reads the configuration variable
> >> spark.yarn.scheduler.heartbeat.interval-ms (the default value is 5000,
> >> which is 5 secs).
> >>
> >> According to the comment, “we want to be reasonably responsive without
> >> causing too many requests to RM”.
> >>
> >> So, unless YARN immediately fulfill the allocation request, it seems
> that
> >> 5 secs will be wasted.
> >>
> >>
> >>
> >> When I modified the configuration variable to 1000, it only waited for 1
> >> sec.
> >>
> >>
> >>
> >> Here is the log lines after the change:
> >>
> >>
> >>
> >> 15/05/08 11:47:21 INFO yarn.ApplicationMaster: Started progress reporter
> >> thread - sleep time : 1000
> >>
> >> 15/05/08 11:47:22 INFO impl.AMRMClientImpl: Received new token for :
> >> cluster04:45454
> >>
> >>
> >>
> >> 4 secs saved.
> >>
> >> So, when one does not want to wait 5 secs, one can change the
> >> spark.yarn.scheduler.heartbeat.interval-ms.
> >>
> >> I hope that the additional overhead it incurs would be negligible.
> >>
> >>
> >>
> >>
> >>
> >> *From:* Zoltán Zvara [mailto:zoltan.zv...@gmail.com]
> >> *Sent:* Thursday, May 07, 2015 10:05 PM
> >> *To:* Taeyun Kim; u...@spark.apache.org
> >> *Subject:* Re: YARN mode startup takes too long (10+ secs)
> >>
> >>
> >>
> >> Without considering everything, just a few hints:
> >>
> >> You are running on YARN. From 09:18:34 to 09:18:37 your application is
> in
> >> state ACCEPTED. There is a noticeable overhead introduced due to
> >> communicating with YARN's ResourceManager, NodeManager and given that
> the
> >> YARN scheduler needs time to make a decision. I guess somewhere
> >> from 09:18:38 to 09:18:43 your application JAR gets copied to another
> >> container requested by the Spark ApplicationMaster deployed on YARN's
> >> container 0. Deploying an executor needs further resource negotiations
> with
> >> the ResourceManager usually. Also, as I said, your JAR and Executor's
> code
> >> requires copying to the container's local directory - execution blocked
> >> until that is complete.
> >>
> >>
> >>
> >> On Thu, May 7, 2015 at 3:09 AM Taeyun Kim <taeyun....@innowireless.com>
> >> wrote:
> >>
> >> Hi,
> >>
> >>
> >>
> >> I’m running a spark application with YARN-client or YARN-cluster mode.
> >>
> >> But it seems to take too long to startup.
> >>
> >> It takes 10+ seconds to initialize the spark context.
> >>
> >> Is this normal? Or can it be optimized?
> >>
> >>
> >>
> >> The environment is as follows:
> >>
> >> - Hadoop: Hortonworks HDP 2.2 (Hadoop 2.6)
> >>
> >> - Spark: 1.3.1
> >>
> >> - Client: Windows 7, but similar result on CentOS 6.6
> >>
> >>
> >>
> >> The following is the startup part of the application log. (Some private
> >> information was edited)
> >>
> >> ‘Main: Initializing context’ at the first line and ‘MainProcessor:
> >> Deleting previous output files’ at the last line are the logs by the
> >> application. Others in between are from Spark itself. Application logic
> is
> >> executed after this log is displayed.
> >>
> >>
> >>
> >> ---
> >>
> >>
> >>
> >> 15/05/07 09:18:31 INFO Main: Initializing context
> >>
> >> 15/05/07 09:18:31 INFO SparkContext: Running Spark version 1.3.1
> >>
> >> 15/05/07 09:18:31 INFO SecurityManager: Changing view acls to:
> myuser,myapp
> >>
> >> 15/05/07 09:18:31 INFO SecurityManager: Changing modify acls to:
> >> myuser,myapp
> >>
> >> 15/05/07 09:18:31 INFO SecurityManager: SecurityManager: authentication
> >> disabled; ui acls disabled; users with view permissions: Set(myuser,
> >> myapp); users with modify permissions: Set(myuser, myapp)
> >>
> >> 15/05/07 09:18:31 INFO Slf4jLogger: Slf4jLogger started
> >>
> >> 15/05/07 09:18:31 INFO Remoting: Starting remoting
> >>
> >> 15/05/07 09:18:31 INFO Remoting: Remoting started; listening on
> addresses
> >> :[akka.tcp://sparkDriver@mymachine:54449]
> >>
> >> 15/05/07 09:18:31 INFO Utils: Successfully started service 'sparkDriver'
> >> on port 54449.
> >>
> >> 15/05/07 09:18:31 INFO SparkEnv: Registering MapOutputTracker
> >>
> >> 15/05/07 09:18:32 INFO SparkEnv: Registering BlockManagerMaster
> >>
> >> 15/05/07 09:18:32 INFO DiskBlockManager: Created local directory at
> >>
> C:\Users\myuser\AppData\Local\Temp\spark-2d3db9d6-ea78-438e-956f-be9c1dcf3a9d\blockmgr-e9ade223-a4b8-4d9f-b038-efd66adf9772
> >>
> >> 15/05/07 09:18:32 INFO MemoryStore: MemoryStore started with capacity
> >> 1956.7 MB
> >>
> >> 15/05/07 09:18:32 INFO HttpFileServer: HTTP File server directory is
> >>
> C:\Users\myuser\AppData\Local\Temp\spark-ff40d73b-e8ab-433e-88c4-35da27fb6278\httpd-def9220f-ac3a-4dd2-9ac1-2c593b94b2d9
> >>
> >> 15/05/07 09:18:32 INFO HttpServer: Starting HTTP Server
> >>
> >> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
> >>
> >> 15/05/07 09:18:32 INFO AbstractConnector: Started
> >> SocketConnector@0.0.0.0:54450
> >>
> >> 15/05/07 09:18:32 INFO Utils: Successfully started service 'HTTP file
> >> server' on port 54450.
> >>
> >> 15/05/07 09:18:32 INFO SparkEnv: Registering OutputCommitCoordinator
> >>
> >> 15/05/07 09:18:32 INFO Server: jetty-8.y.z-SNAPSHOT
> >>
> >> 15/05/07 09:18:32 INFO AbstractConnector: Started
> >> SelectChannelConnector@0.0.0.0:4040
> >>
> >> 15/05/07 09:18:32 INFO Utils: Successfully started service 'SparkUI' on
> >> port 4040.
> >>
> >> 15/05/07 09:18:32 INFO SparkUI: Started SparkUI at
> http://mymachine:4040
> >>
> >> 15/05/07 09:18:32 INFO SparkContext: Added JAR
> >> file:/D:/Projects/MyApp/MyApp.jar at
> >> http://10.111.111.199:54450/jars/MyApp.jar with timestamp 1430957912240
> >>
> >> 15/05/07 09:18:32 INFO RMProxy: Connecting to ResourceManager at
> cluster01/
> >> 10.111.111.11:8050
> >>
> >> 15/05/07 09:18:32 INFO Client: Requesting a new application from cluster
> >> with 3 NodeManagers
> >>
> >> 15/05/07 09:18:32 INFO Client: Verifying our application has not
> requested
> >> more than the maximum memory capability of the cluster (23040 MB per
> >> container)
> >>
> >> 15/05/07 09:18:32 INFO Client: Will allocate AM container, with 896 MB
> >> memory including 384 MB overhead
> >>
> >> 15/05/07 09:18:32 INFO Client: Setting up container launch context for
> our
> >> AM
> >>
> >> 15/05/07 09:18:32 INFO Client: Preparing resources for our AM container
> >>
> >> 15/05/07 09:18:32 INFO Client: Source and destination file systems are
> the
> >> same. Not copying
> >> hdfs://cluster01/apps/spark/spark-assembly-1.3.1-hadoop2.6.0.jar
> >>
> >> 15/05/07 09:18:32 INFO Client: Setting up the launch environment for our
> >> AM container
> >>
> >> 15/05/07 09:18:33 INFO SecurityManager: Changing view acls to:
> myuser,myapp
> >>
> >> 15/05/07 09:18:33 INFO SecurityManager: Changing modify acls to:
> >> myuser,myapp
> >>
> >> 15/05/07 09:18:33 INFO SecurityManager: SecurityManager: authentication
> >> disabled; ui acls disabled; users with view permissions: Set(myuser,
> >> myapp); users with modify permissions: Set(myuser, myapp)
> >>
> >> 15/05/07 09:18:33 INFO Client: Submitting application 2 to
> ResourceManager
> >>
> >> 15/05/07 09:18:33 INFO YarnClientImpl: Submitted application
> >> application_1430956687773_0002
> >>
> >> 15/05/07 09:18:34 INFO Client: Application report for
> >> application_1430956687773_0002 (state: ACCEPTED)
> >>
> >> 15/05/07 09:18:34 INFO Client:
> >>
> >>             client token: N/A
> >>
> >>            diagnostics: N/A
> >>
> >>            ApplicationMaster host: N/A
> >>
> >>            ApplicationMaster RPC port: -1
> >>
> >>            queue: default
> >>
> >>            start time: 1430957906540
> >>
> >>            final status: UNDEFINED
> >>
> >>            tracking URL:
> >> http://cluster01:8088/proxy/application_1430956687773_0002/
> >>
> >>            user: myapp
> >>
> >> 15/05/07 09:18:35 INFO Client: Application report for
> >> application_1430956687773_0002 (state: ACCEPTED)
> >>
> >> 15/05/07 09:18:36 INFO Client: Application report for
> >> application_1430956687773_0002 (state: ACCEPTED)
> >>
> >> 15/05/07 09:18:37 INFO Client: Application report for
> >> application_1430956687773_0002 (state: ACCEPTED)
> >>
> >> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: ApplicationMaster
> >> registered as Actor[akka.tcp://sparkYarnAM@cluster02
> >> :39698/user/YarnAM#-1579648782]
> >>
> >> 15/05/07 09:18:37 INFO YarnClientSchedulerBackend: Add WebUI Filter.
> >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter,
> Map(PROXY_HOSTS
> >> -> cluster01, PROXY_URI_BASES ->
> >> http://cluster01:8088/proxy/application_1430956687773_0002),
> >> /proxy/application_1430956687773_0002
> >>
> >> 15/05/07 09:18:37 INFO JettyUtils: Adding filter:
> >> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
> >>
> >> 15/05/07 09:18:38 INFO Client: Application report for
> >> application_1430956687773_0002 (state: RUNNING)
> >>
> >> 15/05/07 09:18:38 INFO Client:
> >>
> >>             client token: N/A
> >>
> >>            diagnostics: N/A
> >>
> >>            ApplicationMaster host: cluster02
> >>
> >>            ApplicationMaster RPC port: 0
> >>
> >>            queue: default
> >>
> >>            start time: 1430957906540
> >>
> >>            final status: UNDEFINED
> >>
> >>            tracking URL:
> >> http://cluster01:8088/proxy/application_1430956687773_0002/
> >>
> >>            user: myapp
> >>
> >> 15/05/07 09:18:38 INFO YarnClientSchedulerBackend: Application
> >> application_1430956687773_0002 has started running.
> >>
> >> 15/05/07 09:18:38 INFO NettyBlockTransferService: Server created on
> 54491
> >>
> >> 15/05/07 09:18:38 INFO BlockManagerMaster: Trying to register
> BlockManager
> >>
> >> 15/05/07 09:18:38 INFO BlockManagerMasterActor: Registering block
> manager
> >> mymachine:54491 with 1956.7 MB RAM, BlockManagerId(<driver>, mymachine,
> >> 54491)
> >>
> >> 15/05/07 09:18:38 INFO BlockManagerMaster: Registered BlockManager
> >>
> >> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: Registered executor:
> >> Actor[akka.tcp://sparkExecutor@cluster02
> :44996/user/Executor#-786778979]
> >> with ID 1
> >>
> >> 15/05/07 09:18:43 INFO YarnClientSchedulerBackend: SchedulerBackend is
> >> ready for scheduling beginning after reached
> minRegisteredResourcesRatio:
> >> 0.8
> >>
> >> 15/05/07 09:18:43 INFO MainProcessor: Deleting previous output files
> >>
> >>
> >>
> >> Thanks.
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to