[ 
https://issues.apache.org/jira/browse/FLINK-39625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079282#comment-18079282
 ] 

Yaroslav Tkachenko commented on FLINK-39625:
--------------------------------------------

Hi, is there an accepted FLIP for this issue?

> Support GPU-based model inference via sidecar/actor pattern
> -----------------------------------------------------------
>
>                 Key: FLINK-39625
>                 URL: https://issues.apache.org/jira/browse/FLINK-39625
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Coordination
>            Reporter: featzhang
>            Priority: Major
>              Labels: flip-proposal, gpu, model-inference
>
> h2. Motivation
> Real-time stream processing increasingly incorporates machine-learning 
> inference
> directly into streaming pipelines, for use cases such as feature enrichment,
> anomaly detection, content ranking, and fraud detection. Today, users who
> need GPU-accelerated model inference inside Flink typically embed the model
> weights inside the UDF or operator, which has several drawbacks:
> * Every task slot holding the operator loads its own copy of the model into
>  GPU memory, wasting VRAM and slot initialisation time.
> * The model's lifecycle is coupled to Flink's task lifecycle: any restart,
>  rescale, or failover causes model reload.
> * Request batching must be implemented ad-hoc inside each UDF, and cross-task
>  batching is impossible because each task only sees its own local requests.
> * Flink has no first-class notion of GPU resources in ResourceProfile, so
>  GPU assignment relies on external schedulers or manual pinning.
> Inspired by the actor pattern used in frameworks such as Ray, this proposal
> introduces a long-lived "GPU sidecar" process, co-located with each GPU-
> equipped TaskManager. Flink operators invoke the sidecar over an RPC
> channel; the sidecar owns model loading, GPU memory management, and
> cross-operator request batching.
> h2. Goals
> * Add a standardised way to declare GPU resources on TaskManager and request
>  them through ResourceProfile.
> * Provide a first-party GPU sidecar service that hosts one or more models on
>  each GPU node, loads them once, and serves asynchronous inference
>  requests.
> * Allow Flink operators (both DataStream async I/O and Table/SQL functions)
>  to invoke the sidecar efficiently, with bounded-queue back-pressure and
>  cross-request batching.
> * Schedule GPU-affinity operators close to a live sidecar via existing
>  ResourceManager mechanisms.
> * Expose metrics for GPU utilisation, queue depth, batch size, and
>  per-request latency.
> h2. Non-goals
> * This proposal does not introduce a new model-training runtime; training
>  remains the responsibility of external systems.
> * It does not prescribe a specific inference engine (TensorRT, ONNX
>  Runtime, PyTorch, and plain CUDA kernels are all acceptable backends).
> * It does not replace the existing external-async patterns (Async I/O,
>  Lookup Join); those remain valid for non-GPU remote calls.
> h2. Proposed design overview
> Three cooperating components:
> # *TaskManager-side resource declaration* - extend {{ResourceProfile}} with a
>  GPU dimension, advertised by TaskManagers that opt in.
> # *GPU sidecar process* - a standalone service, one per GPU node, that loads
>  models on startup, exposes an RPC endpoint (initial implementation: gRPC
>  over UDS / TCP), and batches incoming inference requests.
> # *Flink-side client* - an Async I/O based operator and a Table/SQL UDF that
>  translate records into sidecar RPC calls, enforce ordering and timeouts,
>  and surface metrics.
> The sidecar is co-located with the TaskManager but runs in its own process
> so that GPU memory is isolated from JVM heap and survives TaskManager
> restarts when possible.
> h2. Compatibility, deprecation, and migration plan
> * Entirely additive. No existing API changes.
> * Feature-flagged via TaskManager configuration; clusters without GPUs are
>  unaffected.
> * The initial release is marked {{@Experimental}}.
> h2. Testing plan
> * Unit tests for the new ResourceProfile dimension and scheduling hints.
> * End-to-end tests using a stubbed CPU "mock sidecar" so the pipeline can
>  run on non-GPU CI agents.
> * Nightly GPU-enabled tests in a dedicated lane (tracked separately).
> h2. Implementation plan
> Work is split into six sequentially mergeable sub-tasks, each producing one
> or two reviewable pull requests. The sub-tasks are linked from this
> umbrella issue.
> # Extend {{ResourceProfile}} to declare GPU resources on TaskManager.
> # Introduce a new module {{flink-gpu-sidecar}} containing the service
>  skeleton (process lifecycle, configuration, health checks, empty RPC
>  surface).
> # Implement asynchronous batched inference RPC inside the sidecar,
>  including queue, batcher, and metrics.
> # Add an Async I/O operator that invokes the sidecar from DataStream API.
> # Route GPU-affinity operators to slots backed by a live sidecar via
>  ResourceManager.
> # Integrate with Table/SQL (UDF + DDL), ship an end-to-end example, and
>  document user-facing configuration.
> h2. Rejected alternatives
> * *Loading the model directly inside the UDF.* Wastes GPU memory, prevents
>  cross-task batching, couples model lifecycle to task lifecycle.
> * *Running an external model-serving system (for example Triton) as the
>  only integration point.* Still valuable and complementary, but does not
>  give Flink a first-class notion of GPU resources or tight scheduling
>  affinity; this proposal provides that foundation and is compatible with
>  external servers as alternative backends.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to