[ 
https://issues.apache.org/jira/browse/FLINK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344388#comment-16344388
 ] 

ASF GitHub Bot commented on FLINK-8431:
---------------------------------------

Github user eastcirclek commented on the issue:

    https://github.com/apache/flink/pull/5307
  
    @tillrohrmann 
    
    As you pointed out, the discussion we had in the mailing list was about JM 
not starting TMs on GPU-equipped agents. It turned out that a Mesos framework 
needs to specify a `GPU_RESOURCES` capability if it wants to get resource 
offers that contain GPUs 
[[link]](http://mesos.apache.org/documentation/latest/gpu-support/#framework-capabilities).
 I managed to start TMs on the GPU-equipped agents by specifying a master flag 
`--fliter_gpu_resources` when starting the Mesos master. 
[MESOS-7576](https://issues.apache.org/jira/browse/MESOS-7576) introduces 
`--filter_gpu_resources` and, when the flag is set to false, Mesos frameworks 
that do not have `GPU_RESOURCES` capability can receive offers that contain 
GPUs from the Mesos master. The problem seemed to be figured out without 
modifying Flink. 
    
    The reason I create 
[FLINK-8431](https://issues.apache.org/jira/browse/FLINK-8431) to allow to 
specify # gpus is that TMs are not going to see GPUs if they do not request 
GPUs explicitly and GPUs are isolated as shown in 
[link](http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags).
    
    Regarding your question,
    > Is the original problem which we want to solve that Flink does not use 
agents which have GPU resources or that Flink cannot specify the number of GPUs 
it requires to run? It looks as if the PR solves the latter ...
    
    Yes, the scope of FLINK-8431 and this PR is confined to the latter.
    
    > but I was wondering whether we shouldn't solve the former problem.
    
    I don't think we need to take care of the former anymore because 
`GPU_RESOURCES` is going to be deprecated in favor of the reservation mechanism 
as shown in 
[link](https://www.mail-archive.com/[email protected]/msg37571.html) and 
[MESOS-7576](https://issues.apache.org/jira/browse/MESOS-7576). Thus, we need 
not split servers into two categories (CPU-only servers and GPU-equipped 
servers) anymore. Nevertheless, we need to specify `GPU_RESOURCES` until it is 
completely deprecated in Mesos-2.x. To this end, I add a `GPU_RESOURCES` 
capability if # gpus are larger than 0.
    
    For those who are in a situation in which JM does not get offers that 
contains GPUs, I'd like to suggest to restart the Mesos master with 
`--filter_gpu_resources` set to false as explained above.



> Allow to specify # GPUs for TaskManager in Mesos
> ------------------------------------------------
>
>                 Key: FLINK-8431
>                 URL: https://issues.apache.org/jira/browse/FLINK-8431
>             Project: Flink
>          Issue Type: Improvement
>          Components: Cluster Management, Mesos
>            Reporter: Dongwon Kim
>            Assignee: Dongwon Kim
>            Priority: Minor
>
> Mesos provides first-class support for Nvidia GPUs [1], but Flink does not 
> exploit it when scheduling TaskManagers. If Mesos agents are configured to 
> isolate GPUs as shown in [2], TaskManagers that do not specify to use GPUs 
> cannot see GPUs at all.
> We, therefore, need to introduce a new configuration property named 
> "mesos.resourcemanager.tasks.gpus" to allow users to specify # of GPUs for 
> each TaskManager process in Mesos.
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2] http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to