[ 
https://issues.apache.org/jira/browse/MESOS-7375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964402#comment-15964402
 ] 

Benjamin Bannier commented on MESOS-7375:
-----------------------------------------

The {{GPU_RESOURCES}} framework capability exists as a fix for clusters with a 
low number of GPU agents. If we'd unconditionally offer resources on GPU agents 
to frameworks not running any tasks using {{gpus}} we might run out of 
auxillary resources also needed for GPU tasks (e.g., {{cpus}} or {{disk}}). 
This might render these agent unusable even for frameworks wanting to use 
{{gpus}} (but require auxillary resources). 

In the other extreme you described here our fix has adverse effects. When every 
agent has GPUs attached but no frameworks wants to run GPU tasks (i.e., when no 
framework declared {{GPU_RESOURCES}}), no offers will be made and all cluster 
resources will become idle. The fix you proposed does fix this extreme, but I 
think then cannot guarantee that enough resources will be available in the case 
of a low number of GPU agents, so I am unsure how just adding such a flag would 
be enough to fix the issue for all possible (or even the majority of) possible 
setups.

It seems one of the deeper issues surfacing here is that the way our allocator 
takes topology into account is limited (only coarse grained offers, wDRF taking 
only globally accumlated resources into account). At the same time it is hard 
for schedulers to get a global picture without capturing e.g., a lot of the 
state known to Mesos. An operator on the other hand already has knowledge of 
the eventually available resources in the cluster and their topology, so I 
wonder where e.g., multirole is available, if it would be possible for 
operators to make sure that sufficient auxillary resources are available to 
make use of GPUs on agents, e.g., with reservations to dedicated roles.

> provide additional insight for framework developers re: GPU_RESOURCES 
> capability
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-7375
>                 URL: https://issues.apache.org/jira/browse/MESOS-7375
>             Project: Mesos
>          Issue Type: Documentation
>            Reporter: James DeFelice
>              Labels: mesosphere
>
> On clusters where all nodes are equal and every node has a GPU, frameworks 
> that **don't** opt-in to the `GPU_RESOURCES` capability won't get any offers. 
> This is surprising for operators.
> Even when a framework doesn't **need** GPU resources, it may make sense for a 
> framework scheduler to provide a `--gpu-cluster-compat` (or similar) flag 
> that results in the framework advertising the `GPU_RESOURCES` capability even 
> though it does not intend to consume any GPU. The effect being that said 
> framework will now receive offers on clusters where all nodes have GPU 
> resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to