[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

JoshRosen Mon, 29 Dec 2014 11:36:20 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3794#issuecomment-68294158
  
    To reformat the PR description to make it a little easier to read:
    
    > HadoopRDD.getPartitions is lazyied to process in 
DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much 
time.  For example, in our cluster, it needs from 0.029s to 766.699s. If one 
JobSubmitted event is processing, others should wait. Thus, we want to put 
HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing 
time. Then other JobSubmitted event don't need to wait much time. HadoopRDD 
object could get its partitons when it is instantiated.
    > 
    > We could analyse and compare the execution time before and after 
optimization.
    > ```
    > TaskScheduler.start execution time: [time1__]
    > DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
    > HadoopRDD.getPartitions execution time: [time3___]
    > Stages execution time: [time4_____].
    > ```
    > (1) The app has only one job
    > (a)
    > ```
    > The execution time of the job before optimization is 
[time1__][time2_][time3___][time4_____].
    > The execution time of the job after optimization 
is....[time1__][time3___][time2_][time4_____].
    > ```
    > In summary, if the app has only one job, the total execution time is same 
before and after optimization.
    > (2) The app has 4 jobs
    > (a) Before optimization,
    > ```
    > job1 execution time is [time2_][time3___][time4_____],
    > job2 execution time is [time2__________][time3___][time4_____],
    > job3 execution time 
is................................[time2____][time3___][time4_____],
    > job4 execution time 
is................................[time2_____________][time3___][time4_____].
    > ```
    > After optimization, 
    > ```
    > job1 execution time is [time3___][time2_][time4_____],
    > job2 execution time is [time3___][time2__][time4_____],
    > job3 execution time 
is................................[time3___][time2_][time4_____],
    > job4 execution time 
is................................[time3___][time2__][time4_____].
    > ```
    > In summary, if the app has multiple jobs, average execution time after 
optimization is less than before.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

Reply via email to