[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage

dhruve Wed, 27 Sep 2017 07:16:29 -0700

Github user dhruve commented on the issue:

    https://github.com/apache/spark/pull/19194
  
    @squito Thanks for pointing that out. What you mentioned makes sense and I 
did dig some more on the `DAGScheduler` and `activeJobForStage` to gather more 
context. We could take into account the properties of the active job when the 
stage is submitted, however this behavior is indeterministic. 
    
    Let's say if we have two jobs from two different job groups with different 
threshold of task concurrency, the one that's submitted first wins as the stage 
won't be recomputed for the second job. In this case, there is no control over 
which job can get submitted first before the second one (unless the user 
explicitly serializes them). The problem aggravates when the difference between 
the task concurrency threshold is large for the two jobs. In such a case, 
having a wrong value can completely take down your remote service.
    
    For a deterministic behavior, I believe the best way to tackle this would 
be to handle in the stage properties as was the ask. However, since it involves 
an API change, I didn't go that route as the scope for that could be much 
broader. If we have more fundamental use cases which require adding something 
like this on the stage level, we should continue in that direction if the 
community is open and welcomes an API change.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19194: [SPARK-20589] Allow limiting task concurrency per stage

Reply via email to