[ 
https://issues.apache.org/jira/browse/SPARK-26104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690725#comment-16690725
 ] 

Apache Spark commented on SPARK-26104:
--------------------------------------

User 'chenqin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23073

> make pci devices visible to task scheduler
> ------------------------------------------
>
>                 Key: SPARK-26104
>                 URL: https://issues.apache.org/jira/browse/SPARK-26104
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Chen Qin
>            Priority: Major
>              Labels: Hydrogen
>
> Spark Task scheduling has long time consider CPU only, depending on how many 
> vcores each executor has at given moment, the task were scheduled as long as 
> enough vcores become available.
> Moving to deep learning use cases, The fundamental computation and processing 
> unit switched from CPU to GPU/FPGA + CPU which moves data in and out of GPU 
> memory.
> Deep learning framework build on top of GPU fleets requires fixture of task 
> to number of GPUs spark haven't support yet. E.g a horord task requires 2 
> GPUs running uninterrupted before it finish regardless how CPU availability 
> in executor. In Uber peloton executor scheduler, the number of cores 
> available could be more than what user asked due to the fact it might get 
> over provisioned.
> Without definitive occupy of pci device(/gpu1, /gpu2), such workload may run 
> into unexpected states.
>  
> related jiras allocating executor containers with gpu resources, serve as 
> bootstrap phase usage
> SPARK-19320 Mesos SPARK-24491 K8s SPARK-20327 YARN
> Existing SPIP: Accelerator Aware Task Scheduling For Spark SPARK-24615, 
> compatible with design, approach is a bit different as it tacks utilization 
> of pci devices where customized taskscheduler could either fallback to "best 
> to have" approach or implement "must have" approach stated above.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to