[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550803#comment-16550803
 ] 

Thomas Graves commented on SPARK-24615:
---------------------------------------

yes if any requirement can't be satisfied it would use dynamic allocation to 
release and reacquire containers.    I'm not saying we have to implement those 
parts right now, I'm saying we should keep them in mind during the design of 
this so they could be added later.  I linked one old Jira that was about 
dynamically changing things. Its been brought up many times after in prs and 
just talking to customers not sure if there are other Jira as well.  Its also 
somewhat related to SPARK-20589 where people just want to configure things per 
stage.

I actually question if this should be done at the rdd level as well.  A set of 
partitions don't care what the resources are, its generally the action you are 
taking on those rdd(s). Note it could be more then one rdd.  I could do etl 
stuff on an RDD which resources would be totally different then if I ran 
tensorflow on that RDD for example.  I do realize this is being tied in with 
the barrier stuff which is on the mapPartitions

I'm not trying to be difficult and realize this Jira is more specific to the 
external ML algo's but don't want many api's for the same thing.

I unfortunately haven't thought through a good solution for this, a while back 
my initial thought was to be able to pass in that resource context to the api 
calls, this obviously gets more tricky especially with pure sql support.  I 
need to think about some more.  the above proposal for .withResources is 
definitely closer but wonder about tying to the rdd still.

cc [~irashid] [~mridulm80] who I think this has been brought up before with.

> Accelerator-aware task scheduling for Spark
> -------------------------------------------
>
>                 Key: SPARK-24615
>                 URL: https://issues.apache.org/jira/browse/SPARK-24615
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>            Priority: Major
>              Labels: Hydrogen, SPIP
>
> In the machine learning area, accelerator card (GPU, FPGA, TPU) is 
> predominant compared to CPUs. To make the current Spark architecture to work 
> with accelerator cards, Spark itself should understand the existence of 
> accelerators and know how to schedule task onto the executors where 
> accelerators are equipped.
> Current Spark’s scheduler schedules tasks based on the locality of the data 
> plus the available of CPUs. This will introduce some problems when scheduling 
> tasks with accelerators required.
>  # CPU cores are usually more than accelerators on one node, using CPU cores 
> to schedule accelerator required tasks will introduce the mismatch.
>  # In one cluster, we always assume that CPU is equipped in each node, but 
> this is not true of accelerator cards.
>  # The existence of heterogeneous tasks (accelerator required or not) 
> requires scheduler to schedule tasks with a smart way.
> So here propose to improve the current scheduler to support heterogeneous 
> tasks (accelerator requires or not). This can be part of the work of Project 
> hydrogen.
> Details is attached in google doc. It doesn't cover all the implementation 
> details, just highlight the parts should be changed.
>  
> CC [~yanboliang] [~merlintang]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to