YanTang Zhai created SPARK-4961:
-----------------------------------
Summary: Put HadoopRDD.getPartitions forward to reduce
DAGScheduler.JobSubmitted processing time
Key: SPARK-4961
URL: https://issues.apache.org/jira/browse/SPARK-4961
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor
HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If
inputdir is large, getPartitions may spend much time.
For example, in our cluster, it needs from 0.029s to 766.699s. If one
JobSubmitted event is processing, others should wait. Thus, we
want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted
processing time. Then other JobSubmitted event don't
need to wait much time. HadoopRDD object could get its partitons when it is
instantiated.
We could analyse and compare the execution time before and after optimization.
TaskScheduler.start execution time: [time1__]
DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or
TaskScheduler.start) execution time: [time2_]
HadoopRDD.getPartitions execution time: [time3___]
Stages execution time: [time4_____]
(1) The app has only one job
(a)
The execution time of the job before optimization is
[time1__][time2_][time3___][time4_____].
The execution time of the job after optimization
is....[time1__][time3___][time2_][time4_____].
In summary, if the app has only one job, the total execution time is same
before and after optimization.
(2) The app has 4 jobs
(a) Before optimization,
job1 execution time is [time2_][time3___][time4_____],
job2 execution time is [time2__________][time3___][time4_____],
job3 execution time
is................................[time2____][time3___][time4_____],
job4 execution time
is................................[time2_____________][time3___][time4_____].
After optimization,
job1 execution time is [time3___][time2_][time4_____],
job2 execution time is [time3___][time2__][time4_____],
job3 execution time
is................................[time3___][time2_][time4_____],
job4 execution time
is................................[time3___][time2__][time4_____].
In summary, if the app has multiple jobs, average execution time after
optimization is less than before.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]