featzhang created FLINK-39625:
---------------------------------

             Summary: Support GPU-based model inference via sidecar/actor 
pattern
                 Key: FLINK-39625
                 URL: https://issues.apache.org/jira/browse/FLINK-39625
             Project: Flink
          Issue Type: New Feature
          Components: Runtime / Coordination
            Reporter: featzhang


h2. Motivation

Real-time stream processing increasingly incorporates machine-learning inference
directly into streaming pipelines, for use cases such as feature enrichment,
anomaly detection, content ranking, and fraud detection. Today, users who
need GPU-accelerated model inference inside Flink typically embed the model
weights inside the UDF or operator, which has several drawbacks:

* Every task slot holding the operator loads its own copy of the model into
 GPU memory, wasting VRAM and slot initialisation time.
* The model's lifecycle is coupled to Flink's task lifecycle: any restart,
 rescale, or failover causes model reload.
* Request batching must be implemented ad-hoc inside each UDF, and cross-task
 batching is impossible because each task only sees its own local requests.
* Flink has no first-class notion of GPU resources in ResourceProfile, so
 GPU assignment relies on external schedulers or manual pinning.

Inspired by the actor pattern used in frameworks such as Ray, this proposal
introduces a long-lived "GPU sidecar" process, co-located with each GPU-
equipped TaskManager. Flink operators invoke the sidecar over an RPC
channel; the sidecar owns model loading, GPU memory management, and
cross-operator request batching.

h2. Goals

* Add a standardised way to declare GPU resources on TaskManager and request
 them through ResourceProfile.
* Provide a first-party GPU sidecar service that hosts one or more models on
 each GPU node, loads them once, and serves asynchronous inference
 requests.
* Allow Flink operators (both DataStream async I/O and Table/SQL functions)
 to invoke the sidecar efficiently, with bounded-queue back-pressure and
 cross-request batching.
* Schedule GPU-affinity operators close to a live sidecar via existing
 ResourceManager mechanisms.
* Expose metrics for GPU utilisation, queue depth, batch size, and
 per-request latency.

h2. Non-goals

* This proposal does not introduce a new model-training runtime; training
 remains the responsibility of external systems.
* It does not prescribe a specific inference engine (TensorRT, ONNX
 Runtime, PyTorch, and plain CUDA kernels are all acceptable backends).
* It does not replace the existing external-async patterns (Async I/O,
 Lookup Join); those remain valid for non-GPU remote calls.

h2. Proposed design overview

Three cooperating components:

# *TaskManager-side resource declaration* - extend {{ResourceProfile}} with a
 GPU dimension, advertised by TaskManagers that opt in.
# *GPU sidecar process* - a standalone service, one per GPU node, that loads
 models on startup, exposes an RPC endpoint (initial implementation: gRPC
 over UDS / TCP), and batches incoming inference requests.
# *Flink-side client* - an Async I/O based operator and a Table/SQL UDF that
 translate records into sidecar RPC calls, enforce ordering and timeouts,
 and surface metrics.

The sidecar is co-located with the TaskManager but runs in its own process
so that GPU memory is isolated from JVM heap and survives TaskManager
restarts when possible.

h2. Compatibility, deprecation, and migration plan

* Entirely additive. No existing API changes.
* Feature-flagged via TaskManager configuration; clusters without GPUs are
 unaffected.
* The initial release is marked {{@Experimental}}.

h2. Testing plan

* Unit tests for the new ResourceProfile dimension and scheduling hints.
* End-to-end tests using a stubbed CPU "mock sidecar" so the pipeline can
 run on non-GPU CI agents.
* Nightly GPU-enabled tests in a dedicated lane (tracked separately).

h2. Implementation plan

Work is split into six sequentially mergeable sub-tasks, each producing one
or two reviewable pull requests. The sub-tasks are linked from this
umbrella issue.

# Extend {{ResourceProfile}} to declare GPU resources on TaskManager.
# Introduce a new module {{flink-gpu-sidecar}} containing the service
 skeleton (process lifecycle, configuration, health checks, empty RPC
 surface).
# Implement asynchronous batched inference RPC inside the sidecar,
 including queue, batcher, and metrics.
# Add an Async I/O operator that invokes the sidecar from DataStream API.
# Route GPU-affinity operators to slots backed by a live sidecar via
 ResourceManager.
# Integrate with Table/SQL (UDF + DDL), ship an end-to-end example, and
 document user-facing configuration.

h2. Rejected alternatives

* *Loading the model directly inside the UDF.* Wastes GPU memory, prevents
 cross-task batching, couples model lifecycle to task lifecycle.
* *Running an external model-serving system (for example Triton) as the
 only integration point.* Still valuable and complementary, but does not
 give Flink a first-class notion of GPU resources or tight scheduling
 affinity; this proposal provides that foundation and is compatible with
 external servers as alternative backends.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to