[
https://issues.apache.org/jira/browse/SPARK-13181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prabhu Joseph updated SPARK-13181:
----------------------------------
Attachment: ran3.JPG
> Spark delay in task scheduling within executor
> ----------------------------------------------
>
> Key: SPARK-13181
> URL: https://issues.apache.org/jira/browse/SPARK-13181
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.5.2
> Reporter: Prabhu Joseph
> Fix For: 1.5.2
>
> Attachments: ran3.JPG
>
>
> When Spark job with some RDD in memory and some in Hadoop, the tasks within
> Executor which reads from memory is started parallel but task to read from
> hadoop is started after some delay.
> Repro:
> A logFile of 1.25 GB is given as input. (5 RDD each of 256MB)
> val logData = sc.textFile(logFile, 2).cache()
> var numAs = logData.filter(line => line.contains("a")).count()
> var numBs = logData.filter(line => line.contains("b")).count()
> Run Spark Job with 1 executor with 6GB memory, 12 cores
> Stage A (reading line with a) - executor starts 5 tasks parallel and all
> reads data from Hadoop.
> Stage B(reading line with b) - As the data is cached (4 RDD is in memory, 1
> is in Hadoop) - executor starts 4 tasks parallel and after 4 seconds delay,
> starts the last task to read from Hadoop.
> On Running the same Spark Job with 12GB memory, all 5 RDD are in memory ans 5
> tasks in Stage B started parallel.
> On Running the job with 2GB memory, all 5 RDD are in Hadoop and 5 tasks in
> stage B started parallel.
> The task delay happens only when some RDD in memory and some in Hadoop.
> Check the attached image.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]