GitHub user dhruve reopened a pull request:
https://github.com/apache/spark/pull/18950
[SPARK-20589][Core][Scheduler] Allow limiting task concurrency per job group
## What changes were proposed in this pull request?
This change allows the user to specify the maximum no. of tasks running in
a given job group. (Kindly see the jira comments section for more context on
why this is implemented at a job group level rather than a stage level). This
change is beneficial where the user wants to avoid having a DoS while trying to
access an eternal service from multiple executors without having the need to
repartition or coalesce existing RDDs.
This code change introduces a new user level configuration:
`spark.job.[userJobGroup].maxConcurrentTasks` which is used to set the active
no. of tasks executing at a given point in time.
The user can use the feature by setting the appropriate jobGroup and
passing the conf:
```
conf.set("spark.job.group1.maxConcurrentTasks", "10")
...
sc.setJobGroup("group1", "", false)
sc.parallelize(1 to 100000, 10).map(x => x + 1).count
sc.clearJobGroup
```
#### changes proposed in this fix
This change limits the no. of tasks (in turn also the no. of executors to
be acquired) than can run simultaneously in a given job group and its
subsequent job/s and stage/s if the appropriate job group and max concurrency
configs are set.
## How was this patch tested?
Ran unit tests and multiple manual tests with various combinations of:
- single/multiple/no job groups
- executors with single/multi cores
- dynamic allocation on/off
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhruve/spark impr/SPARK-20589
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18950.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18950
----
commit 824396c82977171c38ab5d7f6c0f84bc19eccaba
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-15T14:18:21Z
[SPARK-20589] Allow limiting task concurrency per stage
commit d3f8162dab4ca7065d7f296fd03528ce6ddfb923
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-15T14:45:18Z
Merge branch 'master' of github.com:apache/spark into impr/SPARK-20589
commit 824621286ffb107010409c4d0d3442550628247d
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-21T16:51:41Z
Allow limiting task concurrency per stage in concurrent job groups
commit 517acb490ae5938a22c4175347f6bbc24b47781f
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-21T19:30:17Z
Remove comment
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]