Re: [DISCUSS] FLIP: Streaming-native AI Inference Runtime for Flink —
Additional Context and Implementation Mapping


Hi Flink devs,

Thanks for the initial feedback and discussions on the proposed FLIP.

To further clarify the scope and reduce ambiguity, I would like to provide
an *additional consolidated view of existing implementations, abstraction
design, and anticipated questions*.

This follow-up aims to clarify that this proposal is based on *existing
scattered implementations across Flink*, and primarily focuses on unifying
them into a coherent runtime abstraction.
------------------------------
1. Summary of Existing Implementations

The following areas already exist in the codebase and form the foundation
of this proposal:
1.1 SQL / Table API inference entry points

   - AsyncBatch-based inference integration for SQL/Table API
   - Python DataStream API support for async inference workflows

👉 These provide *user-facing entry points for inference execution*, but
are function-level abstractions.
------------------------------
1.2 Streaming inference execution primitives

   - AsyncBatchWaitOperator for asynchronous execution
   - Retry and timeout handling strategies for inference workloads
   - Sequence-based batching consistency mechanisms

👉 These represent *low-level execution building blocks*, but are not
unified under a single runtime abstraction.
------------------------------
1.3 Reliability and fault tolerance mechanisms

   - Retry + fallback for inference failures
   - Circuit breaker for external inference services
   - Connection pooling for HTTP-based inference backends

👉 These are *production-grade reliability patterns*, currently implemented
in a distributed manner across components.
------------------------------
1.4 External inference backend integration

   - Triton inference integration
   - HTTP-based inference services
   - Model-serving backend abstractions (partial)

👉 These enable external model serving integration, but lack a unified
backend abstraction layer.
------------------------------
1.5 Observability and diagnostics

   - Metrics for async inference execution
   - EXPLAIN plan enhancements (cost, rowcount, watermark visibility)
   - Web UI improvements for runtime diagnostics

------------------------------
2. Proposed Abstraction Layer (Refined)

Based on existing implementations, the proposal is to introduce a *unified
Streaming Inference Runtime layer*.
2.1 Core abstraction: InferenceOperator

interface InferenceOperator<IN, OUT> {

    void open(Configuration config);

    CompletableFuture<List<OUT>> asyncInvoke(List<IN> batch);

    void close();
}

Key design goals:

   - batch-oriented execution
   - async-first model
   - backpressure-aware scheduling
   - backend-agnostic execution

------------------------------
2.2 Inference Execution Engine

A runtime component responsible for:

   - Adaptive batching (size + time-based)
   - Backpressure-aware scheduling
   - Retry / timeout / circuit breaker orchestration
   - Concurrency control per operator instance
   - Metrics aggregation

------------------------------
2.3 Pluggable backend model

interface InferenceBackend {
    CompletableFuture<Response> invoke(Request request);
}

Supported backends:

   - HTTP / REST services
   - NVIDIA Triton inference server
   - Custom inference systems (LLM APIs, internal services)

------------------------------
3. Design Principle Clarification

To clarify scope:
This FLIP does NOT:

   - Replace ML_PREDICT
   - Introduce a new ML framework
   - Bind to a specific inference engine

This FLIP DOES:

   - Unify existing inference-related runtime capabilities
   - Introduce a standard execution abstraction
   - Provide production-grade inference runtime semantics

------------------------------
4. Mapping: Issues + PRs → FLIP Layers

(See also previous appendix; summarised here for clarity)
Layer Existing Implementations
SQL/API AsyncBatchFunction, Table API integration
Runtime AsyncBatchWaitOperator, retry/timeout logic
Execution Engine batching, fallback, circuit breaker
Backend Integration Triton, HTTP inference services
Observability metrics, EXPLAIN enhancements, UI diagnostics
------------------------------
5. FAQ / Clarifications Q1: Is this duplicating ML_PREDICT?

No. ML_PREDICT is a *SQL-level semantic abstraction*, while this proposal
focuses on *runtime execution layer abstraction*.

They operate at different layers and are complementary.
------------------------------
Q2: Why not keep using AsyncFunction / Async I/O?

AsyncFunction provides flexibility but lacks:

   - standardized batching semantics
   - unified retry/fallback policies
   - runtime-level scheduling control
   - backend abstraction consistency

This leads to duplicated implementations across users.
------------------------------
Q3: Is this tightly coupled to Triton?

No. Triton is only one supported backend.

The design explicitly supports multiple inference systems (HTTP, LLM APIs,
custom services).
------------------------------
Q4: Does this introduce a new execution engine in Flink?

No.

This is:

   - a new operator abstraction layer
   - built on top of existing streaming runtime
   - fully compatible with checkpointing and scheduling

------------------------------
Q5: Why is this needed now?

With increasing adoption of Flink for:

   - streaming inference
   - LLM-based pipelines
   - hybrid data + AI workflows

there is a growing need for a *standardized inference runtime abstraction*,
rather than fragmented user implementations.
------------------------------
6. Closing

Appreciate any feedback on:

   - scope of runtime abstraction
   - operator vs SQL integration boundary
   - required minimal feature set for initial version

If there is alignment, I will follow up with a more detailed FLIP design
document with concrete API evolution and runtime integration plan.

Thanks,
featzhang


FeatZhang <[email protected]> 于2026年4月19日周日 00:09写道:

> Hi Flink devs,
>
> I would like to start a discussion on a missing piece in Flink’s current
> AI/ML inference capabilities and propose a FLIP for a *streaming-native
> AI inference runtime layer*.
> Motivation
>
> Apache Flink currently provides basic AI inference capabilities through
> SQL-level constructs such as ML_PREDICT and related functions. These are
> useful for integrating external models into batch and streaming pipelines.
>
> However, in production AI workloads (especially real-time inference and
> LLM serving), we observe several gaps:
>
>    - No unified runtime abstraction for inference execution
>    - No streaming-native batching or latency-aware scheduling
>    - Limited support for backpressure-aware inference control
>    - No built-in retry, fallback, or circuit breaker mechanisms
>    - Fragmented integration with external inference systems (e.g., HTTP
>    services, Triton, LLM endpoints)
>
> As a result, users often re-implement these capabilities in user-defined
> functions, leading to inconsistent behavior and duplicated complexity.
> ------------------------------
> Proposal (High-level)
>
> This FLIP proposes introducing a *Streaming-native AI Inference Runtime
> Layer* in Flink, providing:
>
>    - A unified inference operator abstraction
>    - Adaptive batching and concurrency control
>    - Backpressure-aware request scheduling
>    - Pluggable inference backends (HTTP / Triton / custom services)
>    - Built-in reliability mechanisms (retry, timeout, circuit breaker)
>    - Standard metrics and observability hooks
>
> ------------------------------
> Design Overview
>
> The high-level architecture would look like:
>
> DataStream / Table API
>         ↓
> Inference Operator Layer
>         ↓
> Inference Execution Engine
>         ↓
> Pluggable Inference Backend
>
> This layer would integrate with Flink’s existing streaming runtime and
> remain fully compatible with current SQL/Table APIs.
> ------------------------------
> Non-goals
>
>    - This does NOT replace ML_PREDICT or existing SQL semantics
>    - This does NOT introduce a new ML training framework
>    - This is not tied to any specific inference engine
>
> ------------------------------
> Why now
>
> We see increasing adoption of Flink for real-time AI workloads, including:
>
>    - streaming inference
>    - LLM-based pipelines
>    - hybrid AI + data processing workflows
>
> However, the lack of a standardized runtime abstraction makes production
> deployments complex and inconsistent.
> ------------------------------
> Request for feedback
>
> I would like feedback on:
>
>    1. Whether a dedicated inference runtime layer fits within Flink’s
>    architectural direction
>    2. Preferred integration approach (Table API, DataStream, or both)
>    3. Scope of built-in features vs user-defined extensibility
>    4. Any existing efforts or ongoing work in this direction
>
> If there is agreement on direction, I will follow up with a more detailed
> FLIP design document.
> ------------------------------
>
> Thanks,
> featzhang
>

Reply via email to