GitHub user dhruve opened a pull request:
https://github.com/apache/spark/pull/19157
[SPARK-20589][Core][Scheduler] Allow limiting task concurrency per job group
## What changes were proposed in this pull request?
This change allows the user to specify the maximum no. of tasks running in
a given job group. (Kindly see the jira comments section for more context on
why this is implemented at a job group level rather than a stage level). This
change is beneficial where the user wants to avoid having a DoS while trying to
access an eternal service from multiple executors without having the need to
repartition or coalesce existing RDDs.
This code change introduces a new user level configuration:
`spark.job.[userJobGroup].maxConcurrentTasks` which is used to set the active
no. of tasks executing at a given point in time.
The user can use the feature by setting the appropriate jobGroup and
passing the conf:
```
conf.set("spark.job.group1.maxConcurrentTasks", "10")
...
sc.setJobGroup("group1", "", false)
sc.parallelize(1 to 100000, 10).map(x => x + 1).count
sc.clearJobGroup
```
#### changes proposed in this fix
This change limits the no. of tasks (in turn also the no. of executors to
be acquired) than can run simultaneously in a given job group and its
subsequent job/s and stage/s if the appropriate job group and max concurrency
configs are set.
## How was this patch tested?
Ran unit tests and multiple manual tests with various combinations of:
- single/multiple/no job groups
- executors with single/multi cores
- dynamic allocation on/off
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dhruve/spark impr/SPARK-20589
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19157.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19157
----
commit 824396c82977171c38ab5d7f6c0f84bc19eccaba
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-15T14:18:21Z
[SPARK-20589] Allow limiting task concurrency per stage
commit d3f8162dab4ca7065d7f296fd03528ce6ddfb923
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-15T14:45:18Z
Merge branch 'master' of github.com:apache/spark into impr/SPARK-20589
commit 824621286ffb107010409c4d0d3442550628247d
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-21T16:51:41Z
Allow limiting task concurrency per stage in concurrent job groups
commit 517acb490ae5938a22c4175347f6bbc24b47781f
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-21T19:30:17Z
Remove comment
commit 65941f7884551e84a13a6cc2e7488a01e7d8beec
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-21T19:42:05Z
Fix comment style
commit 7aba73a31808f6b1017b85dfd4dd19e28365bd97
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-22T14:54:10Z
Merge branch 'master' of github.com:apache/spark into impr/SPARK-20589
commit 0e518f00ce97fd5d17fe89792c2503d2514b0473
Author: Dhruve Ashar <[email protected]>
Date: 2017-08-22T15:38:01Z
Fix new unit test and add comments
commit 8b3830004d69bd5f109fd9846f59583c23a910c7
Author: Dhruve Ashar <[email protected]>
Date: 2017-09-05T20:14:02Z
Resolve merge conflict and add test for speculative task
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]