[
https://issues.apache.org/jira/browse/FLINK-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16344388#comment-16344388
]
ASF GitHub Bot commented on FLINK-8431:
---------------------------------------
Github user eastcirclek commented on the issue:
https://github.com/apache/flink/pull/5307
@tillrohrmann
As you pointed out, the discussion we had in the mailing list was about JM
not starting TMs on GPU-equipped agents. It turned out that a Mesos framework
needs to specify a `GPU_RESOURCES` capability if it wants to get resource
offers that contain GPUs
[[link]](http://mesos.apache.org/documentation/latest/gpu-support/#framework-capabilities).
I managed to start TMs on the GPU-equipped agents by specifying a master flag
`--fliter_gpu_resources` when starting the Mesos master.
[MESOS-7576](https://issues.apache.org/jira/browse/MESOS-7576) introduces
`--filter_gpu_resources` and, when the flag is set to false, Mesos frameworks
that do not have `GPU_RESOURCES` capability can receive offers that contain
GPUs from the Mesos master. The problem seemed to be figured out without
modifying Flink.
The reason I create
[FLINK-8431](https://issues.apache.org/jira/browse/FLINK-8431) to allow to
specify # gpus is that TMs are not going to see GPUs if they do not request
GPUs explicitly and GPUs are isolated as shown in
[link](http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags).
Regarding your question,
> Is the original problem which we want to solve that Flink does not use
agents which have GPU resources or that Flink cannot specify the number of GPUs
it requires to run? It looks as if the PR solves the latter ...
Yes, the scope of FLINK-8431 and this PR is confined to the latter.
> but I was wondering whether we shouldn't solve the former problem.
I don't think we need to take care of the former anymore because
`GPU_RESOURCES` is going to be deprecated in favor of the reservation mechanism
as shown in
[link](https://www.mail-archive.com/[email protected]/msg37571.html) and
[MESOS-7576](https://issues.apache.org/jira/browse/MESOS-7576). Thus, we need
not split servers into two categories (CPU-only servers and GPU-equipped
servers) anymore. Nevertheless, we need to specify `GPU_RESOURCES` until it is
completely deprecated in Mesos-2.x. To this end, I add a `GPU_RESOURCES`
capability if # gpus are larger than 0.
For those who are in a situation in which JM does not get offers that
contains GPUs, I'd like to suggest to restart the Mesos master with
`--filter_gpu_resources` set to false as explained above.
> Allow to specify # GPUs for TaskManager in Mesos
> ------------------------------------------------
>
> Key: FLINK-8431
> URL: https://issues.apache.org/jira/browse/FLINK-8431
> Project: Flink
> Issue Type: Improvement
> Components: Cluster Management, Mesos
> Reporter: Dongwon Kim
> Assignee: Dongwon Kim
> Priority: Minor
>
> Mesos provides first-class support for Nvidia GPUs [1], but Flink does not
> exploit it when scheduling TaskManagers. If Mesos agents are configured to
> isolate GPUs as shown in [2], TaskManagers that do not specify to use GPUs
> cannot see GPUs at all.
> We, therefore, need to introduce a new configuration property named
> "mesos.resourcemanager.tasks.gpus" to allow users to specify # of GPUs for
> each TaskManager process in Mesos.
> [1] http://mesos.apache.org/documentation/latest/gpu-support/
> [2] http://mesos.apache.org/documentation/latest/gpu-support/#agent-flags
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)