[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

YanTangZhai Fri, 26 Dec 2014 03:36:19 -0800

GitHub user YanTangZhai opened a pull request:

    https://github.com/apache/spark/pull/3810


    [SPARK-4962] [CORE] Put TaskScheduler.start back in SparkContext to shorten 
cluster resources occupation period

    When SparkContext object is instantiated, TaskScheduler is started and some 
resources are allocated from cluster. However, these
    resources may be not used for the moment. For example, 
DAGScheduler.JobSubmitted is processing and so on. These resources are wasted in
    this period. Thus, we want to put TaskScheduler.start back to shorten 
cluster resources occupation period specially for busy cluster.
    TaskScheduler could be started just before running stages.
    We could analyse and compare the resources occupation period before and 
after optimization.
    TaskScheduler.start execution time: [time1__]
    DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
TaskScheduler.start) execution time: [time2_]
    HadoopRDD.getPartitions execution time: [time3___]
    Stages execution time: [time4_____]
    The cluster resources occupation period before optimization is 
[time2_][time3___][time4_____].
    The cluster resources occupation period after optimization 
is....[time3___][time4_____].
    In summary, the cluster resources
    occupation period after optimization is less than before.
    If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period 
may be shorten more which is [time4_____].
    The resources saving is important for busy cluster.
    
    The main purpose of this PR is to decrease resources waste for busy cluster.
    For example, a process initializes a SparkContext instance, reads a few 
files from HDFS or many records from PostgreSQL, and then calls RDD's collect 
operation to submit a job.
    When SparkContext is initialized, an app is submitted to cluster and some 
resources are hold by this app. 
    These resources are not used really until the job is submitted by RDD's 
action.
    The resources in the period from initialization to actual use could be 
considered wasteful.
    If app is submitted when SparkContext is initialized, all of resources 
needed by the app may be granted before running job. 
    Then the job could runs efficiently without resource constraint.
    On the contrary, if app is submitted when job is submitted, resources 
needed by the app may be granted at different times. Then the job may run not 
so efficiently since some resources are applying.
    Thus I use a configuration parameter spark.scheduler.app.slowstart (default 
false) to let user make tradeoffs between economy and efficiency.
    There are 9 kinds of master URL and 6 kinds of SchedulerBackend.
    LocalBackend and SimrSchedulerBackend don't need to put starting back since 
there is no difference.
    SparkClusterSchedulerBackend (yarn-standalone or yarn-cluster) does not put 
starting back since the app should be submitted in advance by SparkSubmit.
    CoarseMesosSchedulerBackend and MesosSchedulerBackend could put starting 
back.
    YarnClientSchedulerBackend (yarn-client) could put starting back.
    This PR puts TaskScheduler.start back only for yarn-client mode in the 
early.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/YanTangZhai/spark SPARK-4962

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3810.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3810
    
----
commit cdef539abc5d2d42d4661373939bdd52ca8ee8e6
Author: YanTangZhai <[email protected]>
Date:   2014-08-06T13:07:08Z

    Merge pull request #1 from apache/master
    
    update

commit cbcba66ad77b96720e58f9d893e87ae5f13b2a95
Author: YanTangZhai <[email protected]>
Date:   2014-08-20T13:14:08Z

    Merge pull request #3 from apache/master
    
    Update

commit 8a0010691b669495b4c327cf83124cabb7da1405
Author: YanTangZhai <[email protected]>
Date:   2014-09-12T06:54:58Z

    Merge pull request #6 from apache/master
    
    Update

commit 03b62b043ab7fd39300677df61c3d93bb9beb9e3
Author: YanTangZhai <[email protected]>
Date:   2014-09-16T12:03:22Z

    Merge pull request #7 from apache/master
    
    Update

commit 76d40277d51f709247df1d3734093bf2c047737d
Author: YanTangZhai <[email protected]>
Date:   2014-10-20T12:52:22Z

    Merge pull request #8 from apache/master
    
    update

commit d26d98248a1a4d0eb15336726b6f44e05dd7a05a
Author: YanTangZhai <[email protected]>
Date:   2014-11-04T09:00:31Z

    Merge pull request #9 from apache/master
    
    Update

commit e249846d9b7967ae52ec3df0fb09e42ffd911a8a
Author: YanTangZhai <[email protected]>
Date:   2014-11-11T03:18:24Z

    Merge pull request #10 from apache/master
    
    Update

commit 6e643f81555d75ec8ef3eb57bf5ecb6520485588
Author: YanTangZhai <[email protected]>
Date:   2014-12-01T11:23:56Z

    Merge pull request #11 from apache/master
    
    Update

commit 718afebe364bd54ac33be425e24183eb1c76b5d3
Author: YanTangZhai <[email protected]>
Date:   2014-12-05T11:08:31Z

    Merge pull request #12 from apache/master
    
    update

commit e4c2c0a18bdc78cc17823cbc2adf3926944e6bc5
Author: YanTangZhai <[email protected]>
Date:   2014-12-24T03:15:22Z

    Merge pull request #15 from apache/master
    
    update

commit 05469de9f0482bce54a60161b9cb386a64173826
Author: yantangzhai <[email protected]>
Date:   2014-12-26T07:11:30Z

    [SPARK-4962] [CORE] Put TaskScheduler.start back in SparkContext to shorten 
cluster resources occupation period

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4962] [CORE] Put TaskScheduler.start ba...

Reply via email to