[
https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-4961:
-----------------------------
Component/s: (was: Spark Core)
Scheduler
> Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted
> processing time
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-4961
> URL: https://issues.apache.org/jira/browse/SPARK-4961
> Project: Spark
> Issue Type: Improvement
> Components: Scheduler
> Reporter: YanTang Zhai
> Priority: Minor
>
> HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted.
> If inputdir is large, getPartitions may spend much time.
> For example, in our cluster, it needs from 0.029s to 766.699s. If one
> JobSubmitted event is processing, others should wait. Thus, we
> want to put HadoopRDD.getPartitions forward to reduce
> DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
> need to wait much time. HadoopRDD object could get its partitons when it is
> instantiated.
> We could analyse and compare the execution time before and after optimization.
> TaskScheduler.start execution time: [time1__]
> DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or
> TaskScheduler.start) execution time: [time2_]
> HadoopRDD.getPartitions execution time: [time3___]
> Stages execution time: [time4_____]
> (1) The app has only one job
> (a)
> The execution time of the job before optimization is
> [time1__][time2_][time3___][time4_____].
> The execution time of the job after optimization
> is....[time1__][time3___][time2_][time4_____].
> In summary, if the app has only one job, the total execution time is same
> before and after optimization.
> (2) The app has 4 jobs
> (a) Before optimization,
> job1 execution time is [time2_][time3___][time4_____],
> job2 execution time is [time2__________][time3___][time4_____],
> job3 execution time
> is................................[time2____][time3___][time4_____],
> job4 execution time
> is................................[time2_____________][time3___][time4_____].
> After optimization,
> job1 execution time is [time3___][time2_][time4_____],
> job2 execution time is [time3___][time2__][time4_____],
> job3 execution time
> is................................[time3___][time2_][time4_____],
> job4 execution time
> is................................[time3___][time2__][time4_____].
> In summary, if the app has multiple jobs, average execution time after
> optimization is less than before.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]