Hello GPU users, We are currently considering deprecating the requirement that frameworks register with the GPU _RESOURCES capability in order to receive offers that contain GPUs. Going forward, we will recommend that users rely on Mesos's builtin `reservation` mechanism to achieve similar results.
Before deprecating it, we wanted to get a sense from the community if anyone is currently relying on this capability and would like to see it persist. If not, we will begin deprecating it in the next Mesos release and completely remove it in Mesos 2.0. As background, the original motivation for this capability was to keep “legacy” frameworks from inadvertently scheduling jobs that don’t require GPUs on GPU capable machines and thus starving out other frameworks that legitimately want to place GPU jobs on those machines. The assumption here was that most machines in a cluster won't have GPUs installed on them, so some mechanism was necessary to keep legacy frameworks from scheduling jobs on those machines. In essence, it provided an implicit reservation of GPU machines for "GPU aware" frameworks, bypassing the traditional `reservation` mechanism already built into Mesos. In such a setup, legacy frameworks would be free to schedule jobs on non-GPU machines, and "GPU aware" frameworks would be free to schedule GPU jobs GPU machines and other types of jobs on other machines (or mix and match them however they please). However, the problem comes when *all* machines in a cluster contain GPUs (or even if most of the machines in a cluster container them). When this is the case, we have the opposite problem we were trying to solve by introducing the GPU_RESOURCES capability in the first place. We end up starving out jobs from legacy frameworks that *don’t* require GPU resources because there are not enough machines available that don’t have GPUs on them to service those jobs. We've actually seen this problem manifest in the wild at least once. An alternative to completely deprecating the GPU_RESOURCES flag would be to add a new flag to the mesos master called `--filter-gpu-resources`. When set to `true`, this flag will cause the mesos master to continue to function as it does today. That is, it would filter offers containing GPU resources and only send them to frameworks that opt into the GPU_RESOURCES framework capability. When set to `false`, this flag would cause the master to *not* filter offers containing GPU resources, and indiscriminately send them to all frameworks whether they set the GPU_RESOURCES capability or not. , this flag would allow them to keep relying on it without disruption. We'd prefer to deprecate the capability completely, but would consider adding this flag if people are currently relying on the GPU_RESOURCES capability and would like to see it persist We welcome any feedback you have. Kevin + Ben