[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16129380#comment-16129380
 ] 

Imran Rashid commented on SPARK-20589:
--------------------------------------

Its pretty much the same thing whether you're trying to limit in the beginning 
the pipeline, or at the end (or anywhere in between), that was just an example. 
 My suggested workaround is *not* to change the number of partitions -- I know 
the spark is very sensitive to the number of partitions for all sorts of 
reasons.  I'm suggesting you have multiple applications, each with a different 
number of *executors*.  So you can still have a large number of tasks, but with 
a small number of executors you'll constrain concurrency.

also to be clear, the current proposed fix requires *exactly* the thing you are 
saying you don't want to do: "breaking the pipeline into different stages and 
running each with different configs".  You need to take something like

{code}
bigRDD.map(...).filter(...).reduceByKey(...).flatMap(...).join(...).map(...).saveToSomeRateLimitedDestination()
{code}

into

{code}
sc.setJobGroup(...)
val dataReadyToSaveExternally = 
bigRDD.map(...).filter(...).reduceByKey(...).flatMap(...).join(...).map(...)
dataReadyToSaveExternally.persist(DISK)
dataReadyToSave.count()

sc.setJobGroup(...)
dataReadyToSave.saveToSomeRateLimitedDestination()
sc.setJobGroup(...)
{code}

You still need to break up the operations on your RDD, and persist the 
intermediate data somewhere.

In any case, I do understand that is simpler that having two entirely 
independent spark applications.  But I want to make sure this would actually 
help as much as you are expecting.

> Allow limiting task concurrency per stage
> -----------------------------------------
>
>                 Key: SPARK-20589
>                 URL: https://issues.apache.org/jira/browse/SPARK-20589
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to