[ https://issues.apache.org/jira/browse/SPARK-13181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132115#comment-15132115 ]
Prabhu Joseph commented on SPARK-13181: --------------------------------------- Okay, the reason for the task delay within executor when some RDD in memory and some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY, in this case Scheduler waits for spark.locality.wait 3 seconds default. During this period, scheduler waits to launch a data-local task before giving up and launching it on a less-local node. So after making it 0, all tasks started parallel. But learned that it is better not to reduce it to 0. > Spark delay in task scheduling within executor > ---------------------------------------------- > > Key: SPARK-13181 > URL: https://issues.apache.org/jira/browse/SPARK-13181 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.5.2 > Reporter: Prabhu Joseph > Fix For: 1.5.2 > > Attachments: ran3.JPG > > > When Spark job with some RDD in memory and some in Hadoop, the tasks within > Executor which reads from memory is started parallel but task to read from > hadoop is started after some delay. > Repro: > A logFile of 1.25 GB is given as input. (5 RDD each of 256MB) > val logData = sc.textFile(logFile, 2).cache() > var numAs = logData.filter(line => line.contains("a")).count() > var numBs = logData.filter(line => line.contains("b")).count() > Run Spark Job with 1 executor with 6GB memory, 12 cores > Stage A (reading line with a) - executor starts 5 tasks parallel and all > reads data from Hadoop. > Stage B(reading line with b) - As the data is cached (4 RDD is in memory, 1 > is in Hadoop) - executor starts 4 tasks parallel and after 4 seconds delay, > starts the last task to read from Hadoop. > On Running the same Spark Job with 12GB memory, all 5 RDD are in memory ans 5 > tasks in Stage B started parallel. > On Running the job with 2GB memory, all 5 RDD are in Hadoop and 5 tasks in > stage B started parallel. > The task delay happens only when some RDD in memory and some in Hadoop. > Check the attached image. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org