[ 
https://issues.apache.org/jira/browse/SPARK-32429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165934#comment-17165934
 ] 

Thomas Graves commented on SPARK-32429:
---------------------------------------

So this doesn't address the task side, it addresses the executor side. The 
Worker has a discovery script that just returns an array of strings, as long as 
the address is something CUDA_VISIBLE_DEVICES understands its fine. I would 
expect this to be configurable so users could turn it on and off if needed.   
The worker sets it to be how many GPUs it was going to assign to the executor 
before (via passing in the resources file).   Within the executor, we still 
assign a specific GPU to a task when we launch it and the task could set via 
the cuda api (cudaSetDevice) if they wish to restrict it.  Setting 
CUDA_VISIBLE_DEVICES ensures that the task doesn't use a GPU that got assigned 
to another executor. 

This is essentially what YARN and k8s do today running in docker container, the 
container can only see the number of GPUs requested but its still up to app 
code to set to per thread (task) if necessary.

Yeah the python side is perhaps more confusing in that respect, but my 
assumption was just set CUDA_VISIBLE_DEVICES to be the same as the JVM process. 
I think that would still allow you to reuse the python processes for different 
tasks and leaves the same contraint to application code to have to set to 
specifically handle per task setting.  Again I think this is the same as a 
python process spawned inside yarn/k8s docker container with GPU isolation on.  
In general I would expect the GPU to really only be used by either the python 
or the jvm process, but GPUs can handle both using it and context switching as 
long as they don't both use all the memory.

Here is a rough prototype for the java side:

[https://github.com/tgravescs/spark/commit/8f1a13d5eef82f81ef3c424a9a7b4b47903aab7b]

 

> Standalone Mode allow setting CUDA_VISIBLE_DEVICES on executor launch
> ---------------------------------------------------------------------
>
>                 Key: SPARK-32429
>                 URL: https://issues.apache.org/jira/browse/SPARK-32429
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy
>    Affects Versions: 3.0.0
>            Reporter: Thomas Graves
>            Priority: Major
>
> It would be nice if standalone mode could allow users to set 
> CUDA_VISIBLE_DEVICES before launching an executor.  This has multiple 
> benefits. 
>  * kind of an isolation in that the executor can only see the GPUs set there. 
>  * If your GPU application doesn't support explicitly setting the GPU device 
> id, setting this will make any GPU look like the default (id 0) and things 
> generally just work without any explicit setting
>  * New features are being added on newer GPUs that require explicit setting 
> of CUDA_VISIBLE_DEVICES like MIG 
> ([https://www.nvidia.com/en-us/technologies/multi-instance-gpu/])
> The code changes to just set this are very small, once we set them we would 
> also possibly need to change the gpu addresses as it changes them to start 
> from device id 0 again.
> The easiest implementation would just specifically support this and have it 
> behind a config and set when the config is on and GPU resources are 
> allocated. 
> Note we probably want to have this same thing set when we launch a python 
> process as well so that it gets same env.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to