[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

YanTangZhai Wed, 24 Dec 2014 19:20:06 -0800

GitHub user YanTangZhai opened a pull request:

    https://github.com/apache/spark/pull/3794


    [SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time

    HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. 
If inputdir is large, getPartitions may spend much time.
    For example, in our cluster, it needs from 0.029s to 766.699s. If one 
JobSubmitted event is processing, others should wait. Thus, we
    want to put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
    need to wait much time. HadoopRDD object could get its partitons when it is 
instantiated.
    We could analyse and compare the execution time before and after 
optimization.
    TaskScheduler.start execution time: [time1__]
    DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
    HadoopRDD.getPartitions execution time: [time3___]
    Stages execution time: [time4_____]
    (1) The app has only one job
    (a)
    The execution time of the job before optimization is 
[time1__][time2_][time3___][time4_____].
    The execution time of the job after optimization 
is....[time1__][time3___][time2_][time4_____].
    In summary, if the app has only one job, the total execution time is same 
before and after optimization.
    (2) The app has 4 jobs
    (a) Before optimization,
    job1 execution time is [time2_][time3___][time4_____],
    job2 execution time is [time2__________][time3___][time4_____],
    job3 execution time 
is................................[time2____][time3___][time4_____],
    job4 execution time 
is................................[time2_____________][time3___][time4_____].
    After optimization, 
    job1 execution time is [time3___][time2_][time4_____],
    job2 execution time is [time3___][time2__][time4_____],
    job3 execution time 
is................................[time3___][time2_][time4_____],
    job4 execution time 
is................................[time3___][time2__][time4_____].
    In summary, if the app has multiple jobs, average execution time after 
optimization is less than before.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/YanTangZhai/spark SPARK-4961

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3794.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3794
    
----
commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai <[email protected]>
Date:   2014-08-06T13:07:08Z

    Merge pull request #1 from apache/master
    
    update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai <[email protected]>
Date:   2014-08-20T13:14:08Z

    Merge pull request #3 from apache/master
    
    Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai <[email protected]>
Date:   2014-09-12T06:54:58Z

    Merge pull request #6 from apache/master
    
    Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai <[email protected]>
Date:   2014-09-16T12:03:22Z

    Merge pull request #7 from apache/master
    
    Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai <[email protected]>
Date:   2014-10-20T12:52:22Z

    Merge pull request #8 from apache/master
    
    update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai <[email protected]>
Date:   2014-11-04T09:00:31Z

    Merge pull request #9 from apache/master
    
    Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai <[email protected]>
Date:   2014-11-11T03:18:24Z

    Merge pull request #10 from apache/master
    
    Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai <[email protected]>
Date:   2014-12-01T11:23:56Z

    Merge pull request #11 from apache/master
    
    Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai <[email protected]>
Date:   2014-12-05T11:08:31Z

    Merge pull request #12 from apache/master
    
    update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai <[email protected]>
Date:   2014-12-24T03:15:22Z

    Merge pull request #15 from apache/master
    
    update

commit 5601a8b1458c9a7317a2e4e0463358f0a054c181
Author: yantangzhai <[email protected]>
Date:   2014-12-25T03:17:57Z

    [SPARK-4961] [CORE] Put HadoopRDD.getPartitions forward to reduce 
DAGScheduler.JobSubmitted processing time

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

Reply via email to