featzhang created FLINK-39630:
---------------------------------
Summary: Schedule GPU-affinity operators via ResourceManager
Key: FLINK-39630
URL: https://issues.apache.org/jira/browse/FLINK-39630
Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination
Reporter: featzhang
h2. Background
The GPU sidecar is a per-node resource: every {{TaskManager}} hosting a
sidecar loads the model once and serves all local operators through it.
For this to work efficiently, operators whose execution depends on the
sidecar must be scheduled onto slots backed by a node that actually runs
a live sidecar process.
This sub-task adds the scheduling hint and resource-matching logic, and
plugs them into the existing ResourceManager flow. It depends on the
{{GPUResource}} work already completed in the resource-profile sub-task.
h2. Scope of this sub-task
* Mark the GPU client operator from the async-operator sub-task with a
{{ResourceSpec}} containing a {{GPUResource}} requirement.
* Extend the slot matcher so that slots advertised by non-GPU
TaskManagers are rejected for such operators.
* Add a lightweight liveness probe in ResourceManager that verifies the
sidecar's {{/health}} endpoint before a slot is handed out; slots with
a not-ready sidecar are temporarily withheld.
* Expose a metric counting the number of rejections due to missing
sidecar liveness, to aid diagnosis.
h2. Out of scope
* Global GPU placement across multiple clusters.
* Re-scheduling on model-weight hot reload (the sidecar handles that
internally).
h2. Acceptance criteria
* Unit tests covering the slot matcher with mixed GPU and non-GPU
TaskManagers.
* Integration test: deploying the async operator on a two-node standalone
cluster (one GPU node with mock sidecar, one plain node) schedules all
subtasks onto the GPU node.
* Liveness probe failures are reflected in the new metric and in logs.
h2. Affected modules
* {{flink-runtime}}
* {{flink-runtime-web}} (surface the new metric)
h2. Links
Parent: see umbrella issue linked to this sub-task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)