[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011024#comment-16011024
 ] 

Amit Kumar commented on SPARK-20589:
------------------------------------

I can probably give more reference. This originally arose from our use case 
when we are working with Images (Binary Data) PairRDD  (url, imageData) . The 
pipeline works mostly as map tasks on the PairRdd with the eventual step being 
uploading it to a Storage Service. 
Now the problem is that the RDD could be huge and it would be expensive to 
persist it before the coalesce, and on the other hand, without persisting the 
reduce parallelism starts affecting the earlier stages.

> Allow limiting task concurrency per stage
> -----------------------------------------
>
>                 Key: SPARK-20589
>                 URL: https://issues.apache.org/jira/browse/SPARK-20589
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to