[
https://issues.apache.org/jira/browse/FLINK-39626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079273#comment-18079273
]
featzhang commented on FLINK-39626:
-----------------------------------
I would like to work on this sub-task under the FLINK-39625 umbrella. Could a
committer please assign it to me (Jira username: featzhang)? Thanks!
> Extend ResourceProfile to declare GPU resources on TaskManager
> --------------------------------------------------------------
>
> Key: FLINK-39626
> URL: https://issues.apache.org/jira/browse/FLINK-39626
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination, Runtime / Task
> Reporter: featzhang
> Priority: Major
> Labels: gpu, model-inference
>
> h2. Background
> Flink currently expresses slot requirements in {{ResourceProfile}} with CPU
> cores, managed memory, task heap memory, and a generic
> {{Map<String, Resource> extendedResources}}. The extended-resource slot is
> intended for pluggable resources such as GPUs, but there is no first-party
> support for declaring, advertising, or matching GPU resources.
> This sub-task adds the concrete definitions and plumbing required so that
> subsequent sub-tasks can schedule operators that depend on a GPU sidecar.
> h2. Scope of this sub-task
> * Add a {{GPUResource}} subclass of {{Resource}} under
> {{flink-core}} or {{flink-runtime}}, carrying at least a logical GPU
> count.
> * Let TaskManagers advertise {{GPUResource}} in the resource profile they
> report to ResourceManager, gated by a configuration option such as
> {{taskmanager.resources.gpu.count}}.
> * Ensure {{ResourceProfile#merge}}, {{#subtract}}, and
> {{#isMatching}} handle the new resource correctly.
> * No scheduling-policy change in this sub-task; scheduling with GPU
> affinity is covered in a separate sub-task.
> h2. Out of scope
> * No model loading, no RPC, no operator changes.
> * No vendor-specific attributes (device UUID, memory per device). Those can
> be added later in a backward-compatible way using the existing extended-
> resource mechanism.
> h2. Acceptance criteria
> * {{ResourceProfile}} round-trips correctly through serialization with a
> {{GPUResource}} set.
> * TaskManager exposes the configured GPU count to ResourceManager.
> * Unit tests cover {{merge}}, {{subtract}}, and {{isMatching}} interactions
> with {{GPUResource}}.
> * No regression in non-GPU cluster startup or existing resource tests.
> h2. Affected modules
> * {{flink-core}}
> * {{flink-runtime}}
> * {{flink-runtime-web}} (if the resource is surfaced in the dashboard in a
> follow-up)
> h2. Links
> Parent: see umbrella issue linked to this sub-task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)