Hello GPU users,
We are currently considering deprecating the requirement that frameworks
register with the GPU _RESOURCES capability in order to receive offers that
contain GPUs. Going forward, we will recommend that users rely on Mesos's
builtin `reservation` mechanism to achieve similar results.
Before deprecating it, we wanted to get a sense from the community if
anyone is currently relying on this capability and would like to see it
persist. If not, we will begin deprecating it in the next Mesos release and
completely remove it in Mesos 2.0.
As background, the original motivation for this capability was to keep
“legacy” frameworks from inadvertently scheduling jobs that don’t require
GPUs on GPU capable machines and thus starving out other frameworks that
legitimately want to place GPU jobs on those machines. The assumption here
was that most machines in a cluster won't have GPUs installed on them, so
some mechanism was necessary to keep legacy frameworks from scheduling jobs
on those machines. In essence, it provided an implicit reservation of GPU
machines for "GPU aware" frameworks, bypassing the traditional
`reservation` mechanism already built into Mesos.
In such a setup, legacy frameworks would be free to schedule jobs on
non-GPU machines, and "GPU aware" frameworks would be free to schedule GPU
jobs GPU machines and other types of jobs on other machines (or mix and
match them however they please).
However, the problem comes when *all* machines in a cluster contain GPUs
(or even if most of the machines in a cluster container them). When this is
the case, we have the opposite problem we were trying to solve by
introducing the GPU_RESOURCES capability in the first place. We end up
starving out jobs from legacy frameworks that *don’t* require GPU resources
because there are not enough machines available that don’t have GPUs on
them to service those jobs. We've actually seen this problem manifest in
the wild at least once.
An alternative to completely deprecating the GPU_RESOURCES flag would be to
add a new flag to the mesos master called `--filter-gpu-resources`. When
set to `true`, this flag will cause the mesos master to continue to
function as it does today. That is, it would filter offers containing GPU
resources and only send them to frameworks that opt into the GPU_RESOURCES
framework capability. When set to `false`, this flag would cause the master
to *not* filter offers containing GPU resources, and indiscriminately send
them to all frameworks whether they set the GPU_RESOURCES capability or not.
, this flag would allow them to keep relying on it without disruption.
We'd prefer to deprecate the capability completely, but would consider
adding this flag if people are currently relying on the GPU_RESOURCES
capability and would like to see it persist
We welcome any feedback you have.
Kevin + Ben