[ 
https://issues.apache.org/jira/browse/FLINK-39626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079273#comment-18079273
 ] 

featzhang commented on FLINK-39626:
-----------------------------------

I would like to work on this sub-task under the FLINK-39625 umbrella. Could a 
committer please assign it to me (Jira username: featzhang)? Thanks!

> Extend ResourceProfile to declare GPU resources on TaskManager
> --------------------------------------------------------------
>
>                 Key: FLINK-39626
>                 URL: https://issues.apache.org/jira/browse/FLINK-39626
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination, Runtime / Task
>            Reporter: featzhang
>            Priority: Major
>              Labels: gpu, model-inference
>
> h2. Background
> Flink currently expresses slot requirements in {{ResourceProfile}} with CPU
> cores, managed memory, task heap memory, and a generic
> {{Map<String, Resource> extendedResources}}. The extended-resource slot is
> intended for pluggable resources such as GPUs, but there is no first-party
> support for declaring, advertising, or matching GPU resources.
> This sub-task adds the concrete definitions and plumbing required so that
> subsequent sub-tasks can schedule operators that depend on a GPU sidecar.
> h2. Scope of this sub-task
> * Add a {{GPUResource}} subclass of {{Resource}} under
>  {{flink-core}} or {{flink-runtime}}, carrying at least a logical GPU
>  count.
> * Let TaskManagers advertise {{GPUResource}} in the resource profile they
>  report to ResourceManager, gated by a configuration option such as
>  {{taskmanager.resources.gpu.count}}.
> * Ensure {{ResourceProfile#merge}}, {{#subtract}}, and
>  {{#isMatching}} handle the new resource correctly.
> * No scheduling-policy change in this sub-task; scheduling with GPU
>  affinity is covered in a separate sub-task.
> h2. Out of scope
> * No model loading, no RPC, no operator changes.
> * No vendor-specific attributes (device UUID, memory per device). Those can
>  be added later in a backward-compatible way using the existing extended-
>  resource mechanism.
> h2. Acceptance criteria
> * {{ResourceProfile}} round-trips correctly through serialization with a
>  {{GPUResource}} set.
> * TaskManager exposes the configured GPU count to ResourceManager.
> * Unit tests cover {{merge}}, {{subtract}}, and {{isMatching}} interactions
>  with {{GPUResource}}.
> * No regression in non-GPU cluster startup or existing resource tests.
> h2. Affected modules
> * {{flink-core}}
> * {{flink-runtime}}
> * {{flink-runtime-web}} (if the resource is surfaced in the dashboard in a
>  follow-up)
> h2. Links
> Parent: see umbrella issue linked to this sub-task.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to