featzhang created FLINK-39625:
---------------------------------
Summary: Support GPU-based model inference via sidecar/actor
pattern
Key: FLINK-39625
URL: https://issues.apache.org/jira/browse/FLINK-39625
Project: Flink
Issue Type: New Feature
Components: Runtime / Coordination
Reporter: featzhang
h2. Motivation
Real-time stream processing increasingly incorporates machine-learning inference
directly into streaming pipelines, for use cases such as feature enrichment,
anomaly detection, content ranking, and fraud detection. Today, users who
need GPU-accelerated model inference inside Flink typically embed the model
weights inside the UDF or operator, which has several drawbacks:
* Every task slot holding the operator loads its own copy of the model into
GPU memory, wasting VRAM and slot initialisation time.
* The model's lifecycle is coupled to Flink's task lifecycle: any restart,
rescale, or failover causes model reload.
* Request batching must be implemented ad-hoc inside each UDF, and cross-task
batching is impossible because each task only sees its own local requests.
* Flink has no first-class notion of GPU resources in ResourceProfile, so
GPU assignment relies on external schedulers or manual pinning.
Inspired by the actor pattern used in frameworks such as Ray, this proposal
introduces a long-lived "GPU sidecar" process, co-located with each GPU-
equipped TaskManager. Flink operators invoke the sidecar over an RPC
channel; the sidecar owns model loading, GPU memory management, and
cross-operator request batching.
h2. Goals
* Add a standardised way to declare GPU resources on TaskManager and request
them through ResourceProfile.
* Provide a first-party GPU sidecar service that hosts one or more models on
each GPU node, loads them once, and serves asynchronous inference
requests.
* Allow Flink operators (both DataStream async I/O and Table/SQL functions)
to invoke the sidecar efficiently, with bounded-queue back-pressure and
cross-request batching.
* Schedule GPU-affinity operators close to a live sidecar via existing
ResourceManager mechanisms.
* Expose metrics for GPU utilisation, queue depth, batch size, and
per-request latency.
h2. Non-goals
* This proposal does not introduce a new model-training runtime; training
remains the responsibility of external systems.
* It does not prescribe a specific inference engine (TensorRT, ONNX
Runtime, PyTorch, and plain CUDA kernels are all acceptable backends).
* It does not replace the existing external-async patterns (Async I/O,
Lookup Join); those remain valid for non-GPU remote calls.
h2. Proposed design overview
Three cooperating components:
# *TaskManager-side resource declaration* - extend {{ResourceProfile}} with a
GPU dimension, advertised by TaskManagers that opt in.
# *GPU sidecar process* - a standalone service, one per GPU node, that loads
models on startup, exposes an RPC endpoint (initial implementation: gRPC
over UDS / TCP), and batches incoming inference requests.
# *Flink-side client* - an Async I/O based operator and a Table/SQL UDF that
translate records into sidecar RPC calls, enforce ordering and timeouts,
and surface metrics.
The sidecar is co-located with the TaskManager but runs in its own process
so that GPU memory is isolated from JVM heap and survives TaskManager
restarts when possible.
h2. Compatibility, deprecation, and migration plan
* Entirely additive. No existing API changes.
* Feature-flagged via TaskManager configuration; clusters without GPUs are
unaffected.
* The initial release is marked {{@Experimental}}.
h2. Testing plan
* Unit tests for the new ResourceProfile dimension and scheduling hints.
* End-to-end tests using a stubbed CPU "mock sidecar" so the pipeline can
run on non-GPU CI agents.
* Nightly GPU-enabled tests in a dedicated lane (tracked separately).
h2. Implementation plan
Work is split into six sequentially mergeable sub-tasks, each producing one
or two reviewable pull requests. The sub-tasks are linked from this
umbrella issue.
# Extend {{ResourceProfile}} to declare GPU resources on TaskManager.
# Introduce a new module {{flink-gpu-sidecar}} containing the service
skeleton (process lifecycle, configuration, health checks, empty RPC
surface).
# Implement asynchronous batched inference RPC inside the sidecar,
including queue, batcher, and metrics.
# Add an Async I/O operator that invokes the sidecar from DataStream API.
# Route GPU-affinity operators to slots backed by a live sidecar via
ResourceManager.
# Integrate with Table/SQL (UDF + DDL), ship an end-to-end example, and
document user-facing configuration.
h2. Rejected alternatives
* *Loading the model directly inside the UDF.* Wastes GPU memory, prevents
cross-task batching, couples model lifecycle to task lifecycle.
* *Running an external model-serving system (for example Triton) as the
only integration point.* Still valuable and complementary, but does not
give Flink a first-class notion of GPU resources or tight scheduling
affinity; this proposal provides that foundation and is compatible with
external servers as alternative backends.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)