[
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964402#comment-15964402
]
Benjamin Bannier commented on MESOS-7375:
-----------------------------------------
The {{GPU_RESOURCES}} framework capability exists as a fix for clusters with a
low number of GPU agents. If we'd unconditionally offer resources on GPU agents
to frameworks not running any tasks using {{gpus}} we might run out of
auxillary resources also needed for GPU tasks (e.g., {{cpus}} or {{disk}}).
This might render these agent unusable even for frameworks wanting to use
{{gpus}} (but require auxillary resources).
In the other extreme you described here our fix has adverse effects. When every
agent has GPUs attached but no frameworks wants to run GPU tasks (i.e., when no
framework declared {{GPU_RESOURCES}}), no offers will be made and all cluster
resources will become idle. The fix you proposed does fix this extreme, but I
think then cannot guarantee that enough resources will be available in the case
of a low number of GPU agents, so I am unsure how just adding such a flag would
be enough to fix the issue for all possible (or even the majority of) possible
setups.
It seems one of the deeper issues surfacing here is that the way our allocator
takes topology into account is limited (only coarse grained offers, wDRF taking
only globally accumlated resources into account). At the same time it is hard
for schedulers to get a global picture without capturing e.g., a lot of the
state known to Mesos. An operator on the other hand already has knowledge of
the eventually available resources in the cluster and their topology, so I
wonder where e.g., multirole is available, if it would be possible for
operators to make sure that sufficient auxillary resources are available to
make use of GPUs on agents, e.g., with reservations to dedicated roles.
> provide additional insight for framework developers re: GPU_RESOURCES
> capability
> --------------------------------------------------------------------------------
>
> Key: MESOS-7375
> URL: https://issues.apache.org/jira/browse/MESOS-7375
> Project: Mesos
> Issue Type: Documentation
> Reporter: James DeFelice
> Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers.
> This is surprising for operators.
> Even when a framework doesn't **need** GPU resources, it may make sense for a
> framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag
> that results in the framework advertising the `GPU_RESOURCES` capability even
> though it does not intend to consume any GPU. The effect being that said
> framework will now receive offers on clusters where all nodes have GPU
> resources.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)