[ 
https://issues.apache.org/jira/browse/SPARK-13181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132115#comment-15132115
 ] 

Prabhu Joseph commented on SPARK-13181:
---------------------------------------

Okay, the reason for the task delay within executor when some RDD in memory and 
some in Hadoop i.e, Multiple Locality Levels NODE_LOCAL and ANY, in this case 
Scheduler waits for spark.locality.wait 3 seconds default. During this period, 
scheduler waits to launch a data-local task before giving up and launching it 
on a less-local node. So after making it 0, all tasks started parallel. But 
learned that it is better not to reduce it to 0. 

> Spark delay in task scheduling within executor
> ----------------------------------------------
>
>                 Key: SPARK-13181
>                 URL: https://issues.apache.org/jira/browse/SPARK-13181
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Prabhu Joseph
>             Fix For: 1.5.2
>
>         Attachments: ran3.JPG
>
>
> When Spark job with some RDD in memory and some in Hadoop, the tasks within 
> Executor which reads from memory is started parallel but task to read from 
> hadoop is started after some delay.
> Repro: 
>     A logFile of 1.25 GB is given as input. (5 RDD each of 256MB) 
>     val logData = sc.textFile(logFile, 2).cache()
>     var numAs = logData.filter(line => line.contains("a")).count()
>     var numBs = logData.filter(line => line.contains("b")).count()
> Run Spark Job with 1 executor with 6GB memory, 12 cores
> Stage A (reading line with a) - executor starts 5 tasks parallel and all 
> reads data from Hadoop.
> Stage B(reading line with b) - As the data is cached (4 RDD is in memory, 1 
> is in Hadoop) - executor starts 4 tasks parallel and after 4 seconds delay, 
> starts the last task to read from Hadoop.
> On Running the same Spark Job with 12GB memory, all 5 RDD are in memory ans 5 
> tasks in Stage B started parallel. 
> On Running the job with 2GB memory, all 5 RDD are in Hadoop and 5 tasks in 
> stage B started parallel. 
> The task delay happens only when some RDD in memory and some in Hadoop.
> Check the attached image.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to