featzhang created FLINK-39628:
---------------------------------
Summary: Implement asynchronous batched inference RPC in GPU
sidecar
Key: FLINK-39628
URL: https://issues.apache.org/jira/browse/FLINK-39628
Project: Flink
Issue Type: Sub-task
Components: Runtime / Task
Reporter: featzhang
h2. Background
With the sidecar process and its empty RPC surface in place, this sub-task
turns the sidecar into a real inference server: it accepts concurrent
inference requests, batches them to make full use of the GPU, and returns
results asynchronously.
h2. Scope of this sub-task
* Add an {{Infer}} RPC to the sidecar proto:
** Bidirectional streaming, so that the client can pipeline requests and
the server can interleave responses.
** Request carries opaque tensor bytes plus a request id; response carries
the same request id plus result tensor bytes or an error.
* Implement a bounded, backpressure-aware request queue inside the sidecar:
** Maximum queue length and maximum wait time are both configurable.
** Once the queue is full the server returns a
{{RESOURCE_EXHAUSTED}}-equivalent status so the client can apply its
own back-pressure.
* Implement a batcher that aggregates queued requests by time window and
maximum batch size, then submits a single batched call to the inference
backend.
* Wire a pluggable backend interface so that the first concrete backend
(a mock / CPU stub for tests) can be replaced with TensorRT, ONNX
Runtime, or PyTorch in follow-up work.
* Publish the following metrics through the existing Flink metrics
reporter abstraction:
** Queue depth.
** Batch size (histogram).
** Inference latency (histogram, end-to-end and per-stage).
** Inflight requests.
h2. Out of scope
* A specific model format (tracked with the concrete backend work).
* Authentication / authorisation on the RPC boundary (tracked separately).
h2. Acceptance criteria
* Throughput and latency benchmarks using the mock backend match the
documented expectations on a reference machine.
* Queue saturation returns a structured error rather than hanging.
* Metrics are visible via the in-process metric reporter and match the
counts observed at the client.
* No memory leak across a 30-minute soak test.
h2. Affected modules
* {{flink-gpu-sidecar}}
h2. Links
Parent: see umbrella issue linked to this sub-task.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)