GitHub user dhruve opened a pull request:

    https://github.com/apache/spark/pull/18950

    [SPARK-20589][Core][Scheduler] Allow limiting task concurrency per job group

    ## What changes were proposed in this pull request?
    This change allows the user to specify the maximum no. of tasks running in 
a given job group. (Kindly see the jira comments section for more context on 
why this is implemented at a job group level rather than a stage level). This 
change is beneficial where the user wants to avoid having a DoS while trying to 
access an eternal service from multiple executors without having the need to 
repartition or coalesce existing RDDs.
    
    This code change introduces a new user level configuration: 
`spark.job.[userJobGroup].maxConcurrentTasks` which is used to set the active 
no. of tasks executing at a given point in time.
    
    The user can use the feature by setting the appropriate jobGroup and 
passing the conf:
    
    `conf.set("spark.job.group1.maxConcurrentTasks", "10")`
    `...`
    `sc.setJobGroup("group1", "", false)`
    `sc.parallelize(1 to 100000, 10).map(x => x + 1).count`
    `sc.clearJobGroup`
    
    `
    
    #### changes proposed in this fix 
    This change limits the no. of tasks (in turn also the no. of executors to 
be acquired) than can run simultaneously in a given job group and its 
subsequent job/s and stage/s if the appropriate job group and max concurrency 
configs are set.
    
    ## How was this patch tested?
    Ran unit tests and multiple manual tests with various combinations of:
    - single/multiple/no job groups
    - executors with single/multi cores
    - dynamic allocation on/off


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dhruve/spark impr/SPARK-20589

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18950.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18950
    
----
commit 824396c82977171c38ab5d7f6c0f84bc19eccaba
Author: Dhruve Ashar <[email protected]>
Date:   2017-08-15T14:18:21Z

    [SPARK-20589] Allow limiting task concurrency per stage

commit d3f8162dab4ca7065d7f296fd03528ce6ddfb923
Author: Dhruve Ashar <[email protected]>
Date:   2017-08-15T14:45:18Z

    Merge branch 'master' of github.com:apache/spark into impr/SPARK-20589

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to