[ 
https://issues.apache.org/jira/browse/SPARK-3545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3545.
----------------------------
    Resolution: Won't Fix

> Put HadoopRDD.getPartitions forward and put TaskScheduler.start back in 
> SparkContext to reduce DAGScheduler.JobSubmitted processing time and shorten 
> cluster resources occupation period
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-3545
>                 URL: https://issues.apache.org/jira/browse/SPARK-3545
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: YanTang Zhai
>            Priority: Minor
>
> We have two problems:
> (1) HadoopRDD.getPartitions is lazyied to process in 
> DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much 
> time. 
> For example, in our cluster, it needs from 0.029s to 766.699s. If one 
> JobSubmitted event is processing, others should wait. Thus, we 
> want to put HadoopRDD.getPartitions forward to reduce 
> DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event 
> don't 
> need to wait much time. HadoopRDD object could get its partitons when it is 
> instantiated.
> (2) When SparkContext object is instantiated, TaskScheduler is started and 
> some resources are allocated from cluster. However, these 
> resources may be not used for the moment. For example, 
> DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
> in 
> this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
> resources occupation period specially for busy cluster. 
> TaskScheduler could be started just before running stages.
> We could analyse and compare the execution time before and after optimization.
> TaskScheduler.start execution time: [time1__]
> DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
> TaskScheduler.start) execution time: [time2_]
> HadoopRDD.getPartitions execution time: [time3___]
> Stages execution time: [time4_____]
> (1) The app has only one job
> (a)
> The execution time of the job before optimization is 
> [time1__][time2_][time3___][time4_____].
> The execution time of the job after optimization 
> is....[time3___][time2_][time1__][time4_____].
> (b)
> The cluster resources occupation period before optimization is 
> [time2_][time3___][time4_____].
> The cluster resources occupation period after optimization is....[time4_____].
> In summary, if the app has only one job, the total execution time is same 
> before and after optimization while the cluster resources 
> occupation period after optimization is less than before.
> (2) The app has 4 jobs
> (a) Before optimization,
> job1 execution time is [time2_][time3___][time4_____],
> job2 execution time is [time2__________][time3___][time4_____],
> job3 execution time 
> is................................[time2____][time3___][time4_____],
> job4 execution time 
> is................................[time2______________][time3___][time4_____].
> After optimization,  
> job1 execution time is [time3___][time2_][time1__][time4_____],
> job2 execution time is [time3___][time2__________][time4_____],
> job3 execution time 
> is................................[time3___][time2_][time4_____],
> job4 execution time 
> is................................[time3___][time2__][time4_____].
> In summary, if the app has multiple jobs, average execution time after 
> optimization is less than before and the cluster resources 
> occupation period after optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to