Re: Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Lincoln Lee Sun, 10 May 2026 23:47:36 -0700

Hi Guowei and all,

Big +1 on the overall direction of FLIP-577.


>From the SQL ecosystem perspective, over the past year the community
has introduced AI functions (e.g., ML_PREDICT, VECTOR_SEARCH) to help
users leverage Flink in AI scenarios. Through community interactions
and enterprise user feedback since then, we can strongly sense the
growing expectation for Flink to go further — broader multimodal data
processing capabilities, improved runtime stability under AI
workloads, and more optimized execution for inference-heavy pipelines.

This FLIP addresses exactly those gaps. In particular, I'm excited about:

 - Multimodal data type extensions — first-class types like
  Tensor, Image, Embedding, and Audio in the type system will unlock
  native SQL/Table expressions over multimodal data, rather than
  forcing users to work around opaque BYTES columns.
 - Object reference mechanism — a critical optimization for
  large-payload multimodal data. Passing multi-megabyte blobs through
  the standard data path is a known pain point today; a
  reference-based approach will significantly reduce serialization
  overhead and memory pressure.
 - Richer built-in AI functions — standardizing common multimodal
  operations (embedding generation, similarity search, model
  inference) as engine-level functions will eliminate the repeated
  ad-hoc UDF implementations we see across organizations today, and
  open the door for engine-level optimizations like shared model
  handles and adaptive batching.

As Jark mentioned, we'd like to participate in and contribute to the
relevant sub-FLIPs.


Best,
Lincoln Lee


Guowei Ma <[email protected]> 于2026年5月10日周日 21:26写道：

> Hi all,
>
> Thanks again for the thoughtful feedback and the valuable perspectives
> shared in this discussion.
>
> I have updated FLIP-577 [1] based on the discussion in this thread. The
> overall direction remains the same, but I have tried to make the scope,
> motivation, and boundaries clearer.
>
> The main changes are:
>
>    1.
>
>    Clarified the target workload as AI-oriented data processing workloads,
>    instead of relying only on the broader "AI-Native" wording.
>    2.
>
>    Added explicit non-goals to make clear that this proposal does not aim
>    to turn Flink into an AI framework, ML platform, or model serving
> system.
>    3.
>
>    Added "Why Now" and "Why Flink" sections to better explain the
>    production signals, ecosystem trends, and why Flink's existing runtime
>    strengths are relevant here.
>    4.
>
>    Reworked the umbrella rationale. The key point is not that every
>    sub-FLIP is AI-specific, but that these mechanisms need a shared runtime
>    contract across data representation, service invocation, GPU resources,
>    scaling, and recovery.
>    5.
>
>    Clarified that engine-level primitives should consider SQL/Table, Java
>    DataStream, and Python DataFrame.
>    6.
>
>    Made the initial correctness scope of runtime mechanisms —
>    non-disruptive scaling, UAC enhancements, and Pipeline Region
> checkpoints —
>    more conservative, with explicit opt-in where default behavior is
> affected.
>
> I also tried to reflect the earlier questions raised in this thread in the
> corresponding sections of the updated document.
>
> Looking forward to the continued discussion.
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
>
> Best,
> Guowei
>
>
> On Sun, May 10, 2026 at 4:19 PM Xintong Song <[email protected]>
> wrote:
>
> > Hi all,
> >
> > First, big thanks to Guowei for kicking off this important discussion,
> and
> > to the community for the substantive engagement over the past days. I'd
> > also like to share some of my own thoughts.
> >
> > My apologies for joining late — I've been tied up with other things and
> > only recently started catching up. Given the length of this thread, it's
> > possible I haven't fully digested every detail. If I miss anything,
> please
> > bear with me and feel free to point it out.
> >
> > Let me state my position upfront: a big +1 to this FLIP. Based on the
> > workload evolution I've been observing, I believe moving Flink toward
> > AI-Native is necessary — this is a project- and community-level effort
> that
> > matters deeply for how Flink responds to the industry shift driven by AI,
> > and how it captures the new opportunities that come with it. We may still
> > need more discussion to align on specific technical details, but at the
> > strategic level of where the community should head, I strongly support
> this
> > umbrella.
> >
> > Below are my thoughts on a few key questions raised in the discussion so
> > far.
> >
> >
> > 1. On the scope of the umbrella
> >
> > Robert, Yaroslav, and Martijn have all pointed out that some items in the
> > FLIP — Pipeline Region independent checkpointing, UAC enhancements,
> > non-disruptive scaling, etc. — are not unique to AI scenarios and would
> > also apply elsewhere. That observation is fair.
> >
> > However, I'd argue the right criterion for whether something belongs
> under
> > this umbrella is not whether it is *exclusively* used for AI, but whether
> > it is *essential* to the central goal here — AI-Native Flink — i.e.,
> > whether AI-Native Flink would actually fall short without it.
> >
> > From that angle, I agree with what Guowei has already laid out: under AI
> > workload characteristics (per-record processing cost orders of magnitude
> > higher, GPU resources expensive and scarce, long and long-tailed
> > asynchronous inference), these otherwise-general mechanisms get pushed to
> > the limit of their current implementations. They may be adequate for
> > traditional BI/ETL workloads, but they become real bottlenecks under AI
> > workloads. Without them, the AI-Native Flink story would be incomplete.
> So
> > I support including them under this umbrella.
> >
> >
> > 2. On usage and maintenance complexity
> >
> > I understand the concerns Yaroslav and Martijn raised about complexity
> > costs. That said, I don't think we should conservatively reject a
> > capability because of complexity, if that capability is critical and
> brings
> > substantial benefits. On the necessity of RpcOperator specifically, Zhu
> > Zhu, Guowei, and others have already provided answers earlier in this
> > thread, and I'll add some context from the Flink Agents side below.
> >
> > As a side observation: with the rapid progress of AI-assisted development
> > over the past year or two, the way engineers work is shifting in a
> > noticeable way. On one hand, AI coding tools are helping developers
> > maintain complex software systems at lower effort. On the other hand,
> more
> > and more users are starting to ask AI how to use Flink correctly, and
> even
> > letting AI help them develop and operate Flink workloads. Usage and
> > maintenance complexity still matter, but their relative weight is
> arguably
> > going down — this is a trend worth factoring into long-term tradeoffs at
> > the community level.
> >
> >
> > 3. On "wait for the industry to stabilize before integrating"
> >
> > Yaroslav raised a concern about the timing — the LLM tooling landscape is
> > changing fast, and it's hard to predict what will be needed tomorrow. I'd
> > offer a different perspective.
> >
> > It's *precisely because* the industry is evolving fast and hasn't yet
> > settled into a stable shape that Flink should join earlier — to influence
> > and help define how that shape forms. If we just stand by waiting for the
> > landscape to settle, by the time it does, there may not be a place for
> > Flink in it, and we'd have missed an important window.
> >
> > In fact, projects like Daft and Ray Data are increasingly defining the de
> > facto standards in this space, as Leonard pointed out earlier in this
> > thread. If Flink doesn't engage proactively, even keeping up later will
> put
> > us in a reactive position.
> >
> > Engaging early does carry some cost of trial and error — some
> capabilities
> > may need to be deprecated or restructured down the line. But I'd argue
> this
> > is a normal part of how open-source projects evolve, and it's both
> > necessary and worth it.
> >
> >
> > 4. RpcOperator from a Flink Agents perspective
> >
> > The benefits of RpcOperator have already been articulated by several
> people
> > earlier in the thread. I'd like to add a concrete scenario from the Flink
> > Agents side — the subagent pattern.
> >
> >
> > 4.1 Brief intro to the subagent pattern
> >
> > In agent practice, after planning, the main agent often spawns one or
> more
> > subagents to handle specific subtasks. Each subagent has its own LLM
> > context — its own system prompt, its own context window, its own toolset
> > and permissions. The main agent only receives the final result from the
> > subagent, not the intermediate steps. This pattern is now widely adopted
> in
> > the industry, and many agent skills explicitly say "spawn a subagent to
> do
> > X".
> >
> > Flink Agents wants to support this pattern for several reasons:
> > compatibility with the existing agent skill ecosystem; context isolation,
> > so that intermediate state during subagent execution doesn't pollute the
> > main agent's context; and one additional advantage that enterprise
> > production scenarios have over single-machine personal-assistant agents —
> > the ability to dispatch resource-intensive subtasks to a shared resource
> > pool for efficient resource reuse.
> >
> >
> > 4.2 Why this needs RpcOperator instead of in-place execution inside the
> > Flink Agents operator
> >
> > Why not just spawn subagents directly inside the Flink Agents operator?
> > Because subagent workloads are highly bursty — within the same job, one
> > minute's input might need several subagents for deep processing; the next
> > minute's input might need none. Under that kind of randomness, statically
> > allocating subagent resources per operator parallelism leaves no good
> > answer: under-allocate and the operator gets blown out the moment demand
> > arrives; over-allocate and resources sit idle most of the time.
> >
> > The natural answer is to put subagent execution into a shared resource
> pool
> > that can be used by multiple upstream operator subtasks, and that can
> scale
> > elastically based on load. This maps directly onto the design and core
> > benefits of RpcOperator that Zhu Zhu laid out earlier — each RpcOperator
> > instance forming its own Pipelined Region, which in turn enables
> > independent scaling and flexible load balancing.
> >
> >
> > 4.3 Why this shared resource pool should be inside Flink rather than
> > deployed externally
> >
> > Two reasons.
> >
> > First, the prompts, toolsets, and permission configurations a subagent
> > needs at execution time are essentially part of the Flink Agents *job
> > definition*. Stripping them out of the job and deploying them separately
> > effectively splits the job definition in two, leaving users to maintain a
> > synchronization relationship between an external deployment and the Flink
> > job. This goes against the developer experience Flink Agents aims to
> > provide.
> >
> > Second, subagent execution needs to be incorporated into Flink's
> checkpoint
> > mechanism. The notion of "stateful" here deserves a brief clarification:
> > each subagent task is self-contained — the full context it needs is
> handed
> > over by the main agent at dispatch time in one shot. That is, subagents
> > don't need to share state across records, so at the task level they look
> > stateless. However, executing a single subagent task takes a significant
> > amount of time and involves multiple rounds of model calls and tool
> calls,
> > where tool calls may have externally observable side effects (sending
> > emails, writing to external systems, etc.). This means we need failure
> > recovery for the in-flight computation state during task execution, so
> that
> > subagent execution preserves exactly-once semantics across failovers.
> >
> >
> > To wrap up, let me reiterate my support for the overall direction of this
> > FLIP. There are clearly technical details still to be worked out, but I
> > believe the direction described by this umbrella is both necessary and
> > important for the long-term evolution of the Flink community. Looking
> > forward to continued discussion in the sub-FLIPs ahead.
> >
> > Thanks!
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Fri, May 8, 2026 at 3:43 PM Guowei Ma <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > Thanks everyone for the in-depth discussion over the past few days. Let
> > me
> > > first summarize my understanding of the discussion so far. I see strong
> > > interest in making Flink better support AI-oriented data processing
> > > workloads, especially multimodal and inference-oriented pipelines. At
> the
> > > same time, a recurring concern is that RpcOperator and some of the
> Layer
> > 3
> > > runtime improvements also look like general Flink capabilities, so why
> > > should they be discussed together under the AI-Native / multimodal
> > > processing umbrella?
> > >
> > > I think the key question is not whether these mechanisms can only be
> used
> > > for AI scenarios, but whether they jointly form the runtime contract
> > Flink
> > > needs to support AI-oriented data processing workloads, especially
> > > multimodal inference pipelines. Capabilities such as RpcOperator,
> > > checkpointing, scaling, and resource management can certainly serve
> > broader
> > > use cases. However, under this class of workloads, they require more
> > > consistent runtime semantics and coordinated design, and together
> > determine
> > > whether the system can provide production-grade execution, recovery,
> and
> > > elasticity.
> > >
> > > Compared with traditional streaming workloads, this class of workloads
> > > changes the data shape, computation pattern, and resource model
> > > significantly. Data is no longer only small structured records; it may
> be
> > > images, video, audio, tensors, embeddings, or object references to
> > external
> > > large objects. Many pipelines also have a relatively shuffle-light
> shape,
> > > such as URI → preprocessing → inference → sink. The computation logic
> > often
> > > includes model inference or service-style invocation, either as remote
> > > inference service invocation or local GPU / accelerator-backed
> execution.
> > > Therefore, the system needs to handle not only ordinary RPC calls, but
> > also
> > > service discovery, backpressure propagation, batching / concurrency
> > > control, timeout / retry, in-flight request draining, model loading,
> GPU
> > > warmup, resource scheduling, and fault recovery.
> > >
> > > From this perspective, RpcOperator is not just “another way to call an
> > > external service,” nor is it merely a deployment mechanism for GPU
> > > operators. More importantly, it defines a service-style operator
> > > abstraction: when inference becomes part of the logical data flow,
> Flink
> > > needs to understand and coordinate these runtime semantics, rather than
> > > hiding inference completely behind external black-box calls inside user
> > > code.
> > >
> > > Some of the Layer 3 runtime improvements follow the same logic. While
> > > mechanisms such as checkpointing or scaling are not exclusive to AI
> > > workloads, inference-oriented workloads fundamentally change their
> > > operational assumptions and cost model, making runtime behavior far
> more
> > > critical than in traditional data processing systems. When per-record
> > > computation is expensive, GPU warmup and model loading are costly, and
> > the
> > > execution environment may involve elastic / preemptible resources, the
> > cost
> > > of global rollback or disruptive scaling becomes much higher than in
> > > traditional row-at-a-time BI / ETL workloads. As a result, runtime
> > behavior
> > > that may have been acceptable for traditional workloads can directly
> > affect
> > > stability and resource efficiency in inference-oriented workloads.
> > >
> > > At the same time, the umbrella proposal helps provide a shared context
> > for
> > > discussing how these capabilities relate to each other and what common
> > > runtime assumptions they rely on. The more important value of the
> > umbrella
> > > is to align on the workload model, design principles, boundaries, and
> > > dependencies between capabilities, so that independently evolving
> pieces
> > > such as RpcOperator, GPU resource declaration, batching / concurrency
> > > control, non-disruptive scaling, and regional checkpointing do not end
> up
> > > with inconsistent runtime semantics.
> > >
> > > Based on this discussion, I will update the proposal to make the
> workload
> > > model, RpcOperator boundary, and Layer 3 dependency relationship
> clearer.
> > >
> > > Best,
> > > Guowei
> > > Best,
> > > Guowei
> > >
> > >
> > > On Fri, May 8, 2026 at 12:37 PM zhangjiaogg <[email protected]>
> wrote:
> > >
> > > > Hi Guowei and all,
> > > >
> > > > Thank you for driving this initiative. Strong +1 on the overall
> > > direction.
> > > >
> > > > From our perspective, the core value of this proposal lies in two
> > areas:
> > > > extending Flink's intelligent processing capabilities for multimodal
> > > data,
> > > > and enabling native, in-pipeline local inference. As AI capabilities
> > > > continue to advance, multimodal data is accounting for an
> increasingly
> > > > large share of overall data volume, and the ability to perform
> > > intelligent,
> > > > real-time processing on this data — not just ingestion or routing,
> but
> > > > actual inference and transformation within the stream — is becoming a
> > > > critical requirement across industries. Today, most pipelines treat
> > > > multimodal objects as opaque blobs and push inference to external
> > > systems,
> > > > which works but at the cost of complexity, latency, and consistency.
> > > >
> > > > This is exactly the pain one of our customers experiences in their
> > > > autonomous driving data pipelines. Video and image data captured by
> > > onboard
> > > > cameras must go through annotation, frame extraction, quality
> > filtering,
> > > > and both unstructured and structured data transformation before it
> can
> > be
> > > > used for model training — a workflow that today requires combining
> > > multiple
> > > > specialized systems: a stream engine for structured processing, a
> > > separate
> > > > framework (e.g., Ray Data) for multimodal processing, and an external
> > > model
> > > > serving layer for inference. Each system boundary introduces
> > intermediate
> > > > storage, operational overhead, and data consistency challenges across
> > the
> > > > full pipeline. If Flink can handle this end-to-end — with native
> > > multimodal
> > > > types, local GPU inference operators, and unified checkpointing — the
> > > > entire workflow becomes a single Flink job, intermediate storage is
> > > > eliminated, and fault recovery covers the pipeline as a whole.
> > > >
> > > > We look forward to the sub-FLIP discussions and would be happy to
> > > > contribute.
> > > >
> > > > Best regards
> > > > Jiao Zhang
> > > >
> > > > At 2026-05-07 12:58:15, "zihao chen" <[email protected]> wrote:
> > > > >Hi, all,
> > > > >
> > > > >I’d like to share some thoughts based on our internal experience
> > > > >with AI workloads on Flink.
> > > > >
> > > > >At Tencent, we have production scenarios where Flink is used in
> > > > >AI-related pipelines.
> > > > >
> > > > >Based on these workloads, we explored elasticity and autoscaling
> > > > >for cloud-native stream processing systems and published our
> > > > >experience in SIGMOD 2025:
> > > > >
> > > > >"Oceanus: Enable SLO-Aware Vertical Autoscaling for
> > > > >Cloud-Native Streaming Services in Tencent" [1]
> > > > >
> > > > >As our workloads evolved, we also started to see increasing
> > > > >GPU-based training and inference scenarios.
> > > > >
> > > > >Our current solution integrates Flink with external GPU services.
> > > > >While this works functionally, it also introduces several practical
> > > > >issues, such as:
> > > > >
> > > > >   -
> > > > >
> > > > >   fragmented lifecycle management
> > > > >   -
> > > > >
> > > > >   operational complexity
> > > > >   -
> > > > >
> > > > >   inconsistent scaling/recovery behavior across systems
> > > > >
> > > > >From this perspective, I think FLIP-577 is exploring a very
> > > > >meaningful direction.
> > > > >
> > > > >In particular, I agree with the idea of integrating GPU-backed
> > > > >computation more naturally into Flink’s runtime model, instead of
> > > > >treating it purely as an external service integration problem.
> > > > >
> > > > >Besides, from the elasticity perspective, our experience is that
> > > > >GPU workloads have very different characteristics compared with
> > > > >traditional CPU workloads:
> > > > >
> > > > >   -
> > > > >
> > > > >   GPU resources are expensive and scarce
> > > > >   -
> > > > >
> > > > >   Startup and replay costs are significantly higher
> > > > >   -
> > > > >
> > > > >   Long-running tasks make scaling and recovery more challenging
> > > > >
> > > > >In our experience, GPU elasticity cannot simply reuse the
> > > > >assumptions behind traditional CPU elasticity.
> > > > >
> > > > >Because of this, elasticity becomes especially important for
> > > > >production AI workloads, not only for resource efficiency, but also
> > > > >for reducing scaling and recovery overhead.
> > > > >
> > > > >More broadly, AI workloads increasingly require Flink to collaborate
> > > > >more naturally with GPU-backed computation, and I believe
> > > > >FLIP-577 is exploring an important direction toward addressing
> > > > >this gap.
> > > > >
> > > > >Overall, I’m looking forward to further discussions about this FLIP.
> > > > >
> > > > >[1] https://dl.acm.org/doi/abs/10.1145/3722212.3724445
> > > > >
> > > > >
> > > > >Best,
> > > > >Zihao Chen
> > > > >
> > > > >Yong Fang <[email protected]> 于2026年5月7日周四 11:11写道：
> > > > >
> > > > >> Hi devs,
> > > > >>
> > > > >> Thanks Guowei for initiating this proposal. I think this is an
> > > important
> > > > >> step for Flink towards the era of AI data processing, very big +1.
> > > > >>
> > > > >> I’d like to share some scenarios and requirements of leveraging
> > > PyFlink
> > > > for
> > > > >> AI data processing at ByteDance. Currently, we run tens of
> thousands
> > > of
> > > > >> pyflink/flink jobs, using millions of CPU cores.
> > > > >>
> > > > >> 1) Multimodal Data Processing
> > > > >> We want to use PyFlink to generate multimodal feature data. The
> > > typical
> > > > >> workflow starts by reading ID-based raw data and performing large
> > > table
> > > > >> joins and ETL computations. We then fetch multimodal assets such
> as
> > > > images,
> > > > >> videos, texts and audios from object storage by ID. These
> multimodal
> > > > data
> > > > >> are either sent to an RPC service (backed by local models or
> remote
> > > > large
> > > > >> models), or processed via local GPU computing for frame
> extraction,
> > > > >> embedding generation and other tasks. After multimodal
> computation,
> > > > output
> > > > >> results including embeddings, processed images and multimodal
> > metadata
> > > > are
> > > > >> generated and persisted into the downstream multimodal data lake.
> > > > >>
> > > > >> 2) Stream-Batch Unified Data Training
> > > > >> We use PyFlink to consume processed sample data from MQ or data
> > lakes.
> > > > >> Within the data pipeline, data may be shuffled by key, then fed
> into
> > > > >> parameter servers or local services for CPU-based or GPU-based
> model
> > > > >> training. Such workloads strongly demand optimized CPU & GPU
> hybrid
> > > > >> scheduling, worker node restart capability, fast scaling, as well
> as
> > > > native
> > > > >> support for unified streaming and batch training.
> > > > >>
> > > > >> While supporting the above two categories of workloads, we have
> > > several
> > > > >> common requirements for PyFlink and Flink Core:
> > > > >>
> > > > >> 1) Native Python Computing Capability
> > > > >> We need more user-friendly DataFrame APIs, comprehensive built-in
> > data
> > > > >> types for image and audio, as well as richer multimodal computing
> > > > >> operators. This enables users to develop multimodal data
> processing
> > > jobs
> > > > >> more efficiently, and allows better optimization and scheduling at
> > the
> > > > >> execution plan layer.
> > > > >>
> > > > >> 2) CPU & GPU Scheduling and Resource Management
> > > > >> It is expected to tag resource requirements at fine-grained levels
> > > such
> > > > as
> > > > >> python user-defined functions. The scheduler should provide
> enhanced
> > > > >> resource orchestration and task scheduling, enabling more flexible
> > > > >> heterogeneous resource management.
> > > > >>
> > > > >> 3) Embedded Server Node Capability in Pipeline
> > > > >> We hope to launch dedicated server nodes inside a single pipeline,
> > to
> > > > load
> > > > >> local models or access remote models, which may be shared between
> > > > different
> > > > >> operators in the pipeline. This unifies data ETL and multimodal
> > > > computing
> > > > >> within one end-to-end pipeline, greatly simplifying operation and
> > > > >> maintenance for business teams.
> > > > >>
> > > > >> 4) Performance and Stability Optimization
> > > > >> Key enhancements include zero-copy data transfer between Flink TM
> > > > processes
> > > > >> and Python processes, fast scaling of compute nodes, fast
> > > checkpointing
> > > > >> mechanism, and shuffle optimization for large-scale datasets.
> These
> > > > >> improvements will significantly boost the performance and
> stability
> > of
> > > > >> PyFlink multimodal workloads.
> > > > >>
> > > > >> I’m really excited to see that FLIP-577 has covered all the real
> > > > production
> > > > >> scenarios and requirements mentioned above. It's a good chance to
> > > > iterate
> > > > >> and enrich the core capabilities of PyFlink and Flink Core
> targeting
> > > > these
> > > > >> AI data processing scenarios, and build Flink into a first-class
> AI
> > > data
> > > > >> processing engine.
> > > > >>
> > > > >> I’m very much looking forward to the progress.
> > > > >>
> > > > >> Best,
> > > > >> FangYong
> > > > >>
> > > > >> On Wed, May 6, 2026 at 8:36 PM FeatZhang <[email protected]>
> > > wrote:
> > > > >>
> > > > >> > Hi all,
> > > > >> >
> > > > >> > This is a great topic, and honestly long overdue.
> > > > >> >
> > > > >> > With the rapid growth of AI applications, we have been seeing a
> > > > >> > significant increase in real-world demands from users who are
> > > already
> > > > >> > building on Flink and other traditional data processing or BI
> > > engines.
> > > > >> > From a platform perspective, more and more teams are trying to
> > > > >> > integrate AI capabilities directly into their existing streaming
> > > > >> > pipelines, rather than treating them as separate systems.
> > > > >> >
> > > > >> > This is not an isolated trend — it is becoming a common
> > requirement
> > > > >> > across industries.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 1. What is changing in real systems
> > > > >> >
> > > > >> > We are observing a consistent shift:
> > > > >> >
> > > > >> > AI is moving from offline analysis or request-time scoring to
> > > > >> > continuous, event-driven decision making.
> > > > >> >
> > > > >> > In other words:
> > > > >> >
> > > > >> > AI is becoming part of the data stream itself.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 2. Representative production scenarios
> > > > >> >
> > > > >> > 2.1 Real-time fraud detection (per-event decision under strict
> > > > latency)
> > > > >> >
> > > > >> > Typical setup
> > > > >> >
> > > > >> > Continuous transaction stream (payments, logins, transfers)
> > > > >> > Each event must be evaluated within milliseconds
> > > > >> > Decision depends on:
> > > > >> >
> > > > >> > recent user behavior
> > > > >> > device / IP patterns
> > > > >> > short-term aggregates
> > > > >> >
> > > > >> > What is happening in practice
> > > > >> >
> > > > >> > Models are already deeply integrated into decision flow
> > > > >> > Feature freshness directly impacts detection accuracy
> > > > >> >
> > > > >> > Current pain
> > > > >> >
> > > > >> > Features computed in streaming, but inference is remote
> > > > >> > Network overhead adds to critical path latency
> > > > >> > Hard to ensure training/serving consistency
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 2.2 Real-time recommendation and ads (continuous re-ranking)
> > > > >> >
> > > > >> > Typical setup
> > > > >> >
> > > > >> > User interaction stream (click, view, dwell time)
> > > > >> > Continuous feature updates (session + short-term behavior)
> > > > >> > Inference triggered per interaction
> > > > >> >
> > > > >> > What is happening in practice
> > > > >> >
> > > > >> > Increasing reliance on real-time context
> > > > >> > Model-based ranking becomes core logic
> > > > >> >
> > > > >> > Current pain
> > > > >> >
> > > > >> > Offline and online feature pipelines diverge
> > > > >> > Training-serving skew is common
> > > > >> > Inference orchestration is ad-hoc
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 2.3 Streaming RAG / knowledge systems (continuous indexing)
> > > > >> >
> > > > >> > Typical setup
> > > > >> >
> > > > >> > Continuous ingestion of documents, logs, or knowledge
> > > > >> > Pipeline:
> > > > >> >
> > > > >> > chunking → embedding → indexing → retrieval
> > > > >> >
> > > > >> > Typical use cases
> > > > >> >
> > > > >> > AI copilots
> > > > >> > enterprise knowledge assistants
> > > > >> > observability systems
> > > > >> >
> > > > >> > Current pain
> > > > >> >
> > > > >> > Built via loosely coupled services or scripts
> > > > >> > No strong consistency guarantees
> > > > >> > Difficult to scale and recover
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 2.4 Real-time feedback loop (continuous evaluation)
> > > > >> >
> > > > >> > Typical setup
> > > > >> >
> > > > >> > Prediction at time T
> > > > >> > Label arrives at T + Δ
> > > > >> >
> > > > >> > Required processing
> > > > >> >
> > > > >> > prediction stream JOIN label stream → metrics → optimization
> > > > >> >
> > > > >> > Current pain
> > > > >> >
> > > > >> > Alignment logic duplicated across systems
> > > > >> > Late data handling is complex
> > > > >> > No reusable evaluation abstraction
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 2.5 AI replacing rule-based decision logic
> > > > >> >
> > > > >> > Evolution
> > > > >> >
> > > > >> > From:
> > > > >> >
> > > > >> > rule engine / CEP
> > > > >> >
> > > > >> > To:
> > > > >> >
> > > > >> > model / LLM → decision
> > > > >> >
> > > > >> > Implication
> > > > >> >
> > > > >> > AI is becoming the core decision layer inside streaming systems.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 3. Architectural shift
> > > > >> >
> > > > >> > Across all scenarios:
> > > > >> >
> > > > >> > From:
> > > > >> >
> > > > >> > data processing → feature system → model serving → evaluation
> > > > >> >
> > > > >> > To:
> > > > >> >
> > > > >> > stream = feature + inference + decision + feedback
> > > > >> >
> > > > >> > This reflects a fundamental change in system boundaries.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 4. Why externalized architectures break down
> > > > >> >
> > > > >> > Most current implementations rely on multiple systems:
> > > > >> >
> > > > >> > stream processing
> > > > >> > feature store
> > > > >> > model serving
> > > > >> > vector database
> > > > >> >
> > > > >> > This introduces several fundamental issues in real-time
> scenarios.
> > > > >> >
> > > > >> > 4.1 Latency dominated by system boundaries
> > > > >> >
> > > > >> > stream → network → model service → response
> > > > >> >
> > > > >> > network overhead is unavoidable
> > > > >> > batching is not controlled by the stream runtime
> > > > >> > no end-to-end backpressure
> > > > >> >
> > > > >> > Latency becomes a system-level artifact rather than a compute
> > > > property.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 4.2 Inconsistent data between training and serving
> > > > >> >
> > > > >> > offline vs online features
> > > > >> > different definitions or time windows
> > > > >> >
> > > > >> > Models operate on inconsistent data distributions.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 4.3 State fragmentation
> > > > >> >
> > > > >> > user/session context must be rebuilt or fetched
> > > > >> > loss of data locality
> > > > >> > processing becomes call-driven
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 4.4 Feedback loop is not composable
> > > > >> >
> > > > >> > difficult alignment of prediction and label streams
> > > > >> > no unified handling of late data
> > > > >> > duplicated evaluation logic
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 4.5 Operational complexity
> > > > >> >
> > > > >> > multiple systems to scale
> > > > >> > multiple failure domains
> > > > >> > complex debugging paths
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 5. Why this aligns with Flink
> > > > >> >
> > > > >> > These workloads require:
> > > > >> >
> > > > >> > event-driven execution
> > > > >> > strong state management
> > > > >> > precise time semantics
> > > > >> > continuous feedback
> > > > >> >
> > > > >> > These are exactly Flink’s core strengths.
> > > > >> >
> > > > >> > The key insight is:
> > > > >> >
> > > > >> > Inference should be modeled as a dataflow operator, not an
> > external
> > > > >> > service.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 6. Implication
> > > > >> >
> > > > >> > If we model:
> > > > >> >
> > > > >> > data → feature → inference → decision → feedback
> > > > >> >
> > > > >> > within Flink, we can achieve:
> > > > >> >
> > > > >> > unified scheduling
> > > > >> > shared state
> > > > >> > consistent time semantics
> > > > >> > end-to-end fault tolerance
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > 7. Conclusion
> > > > >> >
> > > > >> > This is not simply about adding AI support to Flink.
> > > > >> >
> > > > >> > It is about recognizing that:
> > > > >> >
> > > > >> > Real-time AI systems are fundamentally streaming systems.
> > > > >> >
> > > > >> > The question is whether Flink evolves to support this natively,
> or
> > > > >> > remains a preprocessing layer in front of external AI stacks.
> > > > >> >
> > > > >> > ________________________________
> > > > >> >
> > > > >> > Happy to follow up with a more concrete proposal (e.g.,
> inference
> > > > >> > operator abstraction) if there is interest.
> > > > >> >
> > > > >> > Thanks.
> > > > >> >
> > > > >> > On Mon, May 4, 2026 at 10:31 PM Gen Luo <[email protected]>
> > > wrote:
> > > > >> > >
> > > > >> > > Hi all,
> > > > >> > >
> > > > >> > > Thank you Guowei Ma for driving this discussion, and thanks
> > > everyone
> > > > >> for
> > > > >> > > the valuable insights. Inspired by this exchange, I’d like to
> > > share
> > > > a
> > > > >> few
> > > > >> > > thoughts.
> > > > >> > >
> > > > >> > > While “AI-Native” covers broad ground, I believe this FLIP
> does
> > > not
> > > > >> > > overextend Flink’s scope. It’s a necessary iteration driven by
> > > > evolving
> > > > >> > > user scenarios and AI advancements, particularly multimodal
> > > > processing.
> > > > >> > > Given the growing adoption of multimodal applications and
> > > increasing
> > > > >> > > interest in low-latency inference, initiating these
> enhancements
> > > is
> > > > a
> > > > >> > > timely step to better align Flink with evolving AI workloads.
> > > > >> > >
> > > > >> > > From our engagements with customers and developers, we
> observe a
> > > > clear
> > > > >> > > shift in both workloads and user expectations. Model inference
> > is
> > > > >> > > increasingly central to data pipelines, with multimodal AI
> tasks
> > > > >> growing
> > > > >> > > rapidly. Traditional real-time scenarios (e.g., monitoring and
> > > > >> analytics)
> > > > >> > > now leverage models and agent frameworks like Flink Agent for
> > > > >> > intelligent,
> > > > >> > > multi-turn decision-making, while large-scale offline compute
> is
> > > > also
> > > > >> > > shifting toward LLMs and vision models. Alongside this
> workload
> > > > >> > evolution,
> > > > >> > > developer workflows have adapted: AI practitioners naturally
> > > prefer
> > > > >> > Python
> > > > >> > > and DataFrame-style APIs. As AI-assisted coding matures,
> > aligning
> > > > >> system
> > > > >> > > interfaces with these familiar patterns will directly improve
> > > > >> > AI-generated
> > > > >> > > code quality and significantly lower adoption barriers for the
> > AI
> > > > >> > community.
> > > > >> > >
> > > > >> > > Today, many AI evaluation tools don’t yet recommend Flink for
> AI
> > > > >> > > workloads—largely due to limited visibility of Flink’s
> relevant
> > > > >> > > capabilities rather than fundamental incompatibility. In
> > reality,
> > > > Flink
> > > > >> > has
> > > > >> > > unique strengths here. For example, generating multimodal
> > samples
> > > is
> > > > >> > often
> > > > >> > > a multi-day, GPU-heavy process. Flink’s streaming model,
> > combined
> > > > with
> > > > >> > > checkpointing and reduced disk I/O, is well-suited for such
> > > > >> long-running
> > > > >> > > tasks—a direction also pursued by engines like Daft and Ray
> > Data.
> > > > With
> > > > >> > > Flink’s proven production stability, we’re well-positioned for
> > > both
> > > > >> batch
> > > > >> > > and future real-time multimodal streaming inference. Targeted
> > > > >> > improvements
> > > > >> > > can make these advantages visible, driving better user
> > experiences
> > > > and
> > > > >> > > healthier ecosystem growth.
> > > > >> > >
> > > > >> > > I’d also note a lesson from FlinkML. It attempted to cover
> model
> > > > >> training
> > > > >> > > but struggled to align with the fast-iteration,
> > > > Python/notebook-centric
> > > > >> > > workflows preferred by ML researchers. Flink’s core strength
> > lies
> > > in
> > > > >> > > high-concurrency, production-grade inference orchestration—not
> > > > training
> > > > >> > > lifecycle management (e.g., experiment tracking, versioning).
> > This
> > > > >> > mismatch
> > > > >> > > limited its adoption.
> > > > >> > >
> > > > >> > > This proposal, however, takes a different path. It doesn’t aim
> > to
> > > > >> replace
> > > > >> > > training frameworks. Instead, it introduces modern AI concepts
> > > > >> > (multimodal
> > > > >> > > data, LLMs) as first-class citizens for inference, built atop
> > > > Flink’s
> > > > >> > > computation strengths. Think Ray Data’s scope (plus simple
> > > > co-located
> > > > >> > > serving), not Train/Tune. Crucially, unlike the FlinkML era,
> > > today’s
> > > > >> > models
> > > > >> > > use standardized interfaces and mature serving frameworks,
> > > allowing
> > > > >> Flink
> > > > >> > > to integrate external models seamlessly without heavy
> > > > >> > > customization—significantly lowering project risk.
> > > > >> > >
> > > > >> > > This FLIP marks Flink’s another starting point for the AI era.
> > > While
> > > > >> > > details need refinement, I believe this direction aligns with
> > both
> > > > >> > current
> > > > >> > > and future user needs and Flink’s evolution.
> > > > >> > >
> > > > >> > > Best,
> > > > >> > > Gen
> > > > >> > >
> > > > >> > > On Mon, May 4, 2026 at 12:15 PM Jark Wu <[email protected]>
> > wrote:
> > > > >> > >
> > > > >> > > > Hi Guowei,
> > > > >> > > >
> > > > >> > > > Thanks for driving this. +1 on the overall direction.
> Flink's
> > > > >> > > > streaming processing and checkpoint mechanism give it a
> > > structural
> > > > >> > > > advantage over systems like Daft and Ray. But today, these
> > > runtime
> > > > >> > > > strengths are held back by gaps in Python API, GPU
> scheduling,
> > > and
> > > > >> > > > native multimodal data handling. This umbrella FLIP
> addresses
> > > > exactly
> > > > >> > > > that gap, comprehensively and systematically. I believe
> > > multimodal
> > > > >> > > > data processing is the biggest opportunity for traditional
> > data
> > > > infra
> > > > >> > > > to transition into AI infra, and this is one of the most
> > > important
> > > > >> > > > FLIPs for Flink in the AI era.
> > > > >> > > >
> > > > >> > > > As one of the Table/SQL module maintainers, we would like to
> > > > >> > > > contribute the built-in multimodal processing UDFs (audio,
> > > video,
> > > > >> > > > image, text) and native multimodal data types (Tensor,
> Image,
> > > > >> > > > Embedding, etc.) as first-class citizens in the type system.
> > > > Looking
> > > > >> > > > forward to the sub-FLIP discussions.
> > > > >> > > >
> > > > >> > > > Best,
> > > > >> > > > Jark
> > > > >> > > >
> > > > >> > > > On Thu, 30 Apr 2026 at 18:42, Guowei Ma <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> > > > >
> > > > >> > > > > Hi,Yaroslav
> > > > >> > > > >
> > > > >> > > > > Thanks for taking the time to write this detailed
> feedback.
> > > Let
> > > > me
> > > > >> > > > clarify
> > > > >> > > > > the intent of the proposal first.
> > > > >> > > > >
> > > > >> > > > > I am not saying that Flink should become an AI framework,
> an
> > > ML
> > > > >> > platform,
> > > > >> > > > > or a model serving system. The way I use "AI-Native" in
> this
> > > > >> > proposal is
> > > > >> > > > to
> > > > >> > > > > say that Flink should support, as first-class citizens,
> the
> > > core
> > > > >> > objects
> > > > >> > > > > and execution patterns that frequently show up in
> > AI-oriented
> > > > data
> > > > >> > > > > processing — instead of leaving them entirely to external
> > > > systems
> > > > >> or
> > > > >> > ad
> > > > >> > > > hoc
> > > > >> > > > > user-defined integrations.
> > > > >> > > > >
> > > > >> > > > > These objects and execution patterns include:
> > > > >> > > > >
> > > > >> > > > >    - Multimodal and unstructured data objects such as
> > images,
> > > > >> video,
> > > > >> > > > audio,
> > > > >> > > > >    tensors, embeddings, and object references.
> > > > >> > > > >    - Model inference as part of the data flow, rather than
> > an
> > > > >> > entirely
> > > > >> > > > >    external black-box service call.
> > > > >> > > > >    - Operators backed by heterogeneous resources such as
> > GPUs.
> > > > >> > > > >    - Pythonic and vectorized processing styles.
> > > > >> > > > >    - Long-running, long-tailed asynchronous computation.
> > > > >> > > > >
> > > > >> > > > > "AI-Native" is just a shorthand here, meaning that Flink
> > > should
> > > > >> > natively
> > > > >> > > > > understand and support the core abstractions of this class
> > of
> > > > >> > workloads.
> > > > >> > > > > The FLIP needs to make the target workload class clearer.
> > What
> > > > we
> > > > >> > care
> > > > >> > > > > about is not any specific model paradigm — LLM, CV,
> > > > recommendation,
> > > > >> > or
> > > > >> > > > > traditional ML inference — but a class of data processing
> > > > workloads
> > > > >> > with
> > > > >> > > > > shared runtime and topology characteristics:
> > > > >> > > > >
> > > > >> > > > >    - A single computation may take seconds or even
> minutes,
> > > > instead
> > > > >> > of
> > > > >> > > > >    microseconds as in traditional row-at-a-time
> processing.
> > > > >> > > > >    - Execution often involves heterogeneous resources such
> > as
> > > > CPU +
> > > > >> > GPU,
> > > > >> > > > >    where GPUs are expensive and scarce.
> > > > >> > > > >    - Data is often multimodal large objects (images,
> video,
> > > > audio,
> > > > >> > > > tensors,
> > > > >> > > > >    embeddings), rather than structured small records.
> > > > >> > > > >    - Computation logic often includes model inference or
> > > > >> > service-style
> > > > >> > > > >    invocations as part of the pipeline.
> > > > >> > > > >    - Many target topologies are relatively shuffle-light
> and
> > > > don't
> > > > >> > > > >    necessarily involve complex keyed-state migration, e.g.
> > > URI →
> > > > >> > > > preprocessing
> > > > >> > > > >    → inference → sink.
> > > > >> > > > >
> > > > >> > > > > Ten years ago, many ML workloads took the form of offline
> > > > training
> > > > >> > plus
> > > > >> > > > > online feature serving. Flink already played a strong role
> > in
> > > > >> feature
> > > > >> > > > > engineering, streaming feature computation, and real-time
> > data
> > > > >> > > > preparation,
> > > > >> > > > > so there was no strong need to reshape Flink into an
> > > "ML-Native"
> > > > >> > engine.
> > > > >> > > > >
> > > > >> > > > > What is changing today is that model inference itself is
> > > > >> increasingly
> > > > >> > > > > becoming part of the data processing pipeline; multimodal
> > > > objects
> > > > >> > are no
> > > > >> > > > > longer just opaque blobs in external storage, but data
> > objects
> > > > that
> > > > >> > need
> > > > >> > > > to
> > > > >> > > > > be referenced, passed, transformed, inferred over, and
> > landed
> > > > >> inside
> > > > >> > the
> > > > >> > > > > engine. This is not simply one more ML use case — it is a
> > > > change in
> > > > >> > the
> > > > >> > > > > shape of workloads Flink needs to support.
> > > > >> > > > >
> > > > >> > > > > On whether the user demand is real, the validation signals
> > we
> > > > are
> > > > >> > > > currently
> > > > >> > > > > seeing include:
> > > > >> > > > >
> > > > >> > > > >    - Within Alibaba, multimodal data processing is already
> > in
> > > > >> > production,
> > > > >> > > > >    covering image, video, audio, and text modalities.
> > > > >> > > > >    - In offline conversations with several companies
> > > (including
> > > > >> > ByteDance
> > > > >> > > > >    and Tencent), we have heard substantial demand for
> Flink
> > to
> > > > >> > support
> > > > >> > > > AI data
> > > > >> > > > >    processing / multimodal data processing.
> > > > >> > > > >    - On the ecosystem side, we are working with NVIDIA on
> a
> > > > joint
> > > > >> > demo
> > > > >> > > > >    focused on multimodal data processing, planned for
> Flink
> > > > Forward
> > > > >> > Asia.
> > > > >> > > > >    - The emergence and growth of systems such as Daft, Ray
> > > Data,
> > > > >> > > > >    Data-Juicer, and LAS also reflect rapidly growing
> demand
> > > for
> > > > >> > > > multimodal
> > > > >> > > > >    data processing.
> > > > >> > > > >    - There have also been independent discussions in this
> > > > direction
> > > > >> > > > within
> > > > >> > > > >    the community — for example, the "Streaming-native AI
> > > > Inference
> > > > >> > > > Runtime
> > > > >> > > > >    Layer" proposal on the dev list.
> > > > >> > > > >
> > > > >> > > > > On "why now, instead of waiting for standardization" — I
> > > > understand
> > > > >> > the
> > > > >> > > > > concern. LLM-related frameworks, APIs, and
> application-level
> > > > >> > patterns are
> > > > >> > > > > indeed changing quickly. If this FLIP were trying to bake
> a
> > > > >> specific
> > > > >> > LLM
> > > > >> > > > > API, agent framework, or prompt protocol into Flink, the
> > risk
> > > > would
> > > > >> > be
> > > > >> > > > high.
> > > > >> > > > >
> > > > >> > > > > But most of the capabilities in this proposal are not
> > > > LLM-specific.
> > > > >> > They
> > > > >> > > > > are more fundamental data processing and runtime
> > capabilities:
> > > > >> > Pipeline
> > > > >> > > > > Region-level checkpointing, Object Reference, GPU resource
> > > > >> > declaration,
> > > > >> > > > > columnar data transfer, service-style operator invocation,
> > > > >> > long-running
> > > > >> > > > > async execution. These are useful for today's LLM
> workloads,
> > > and
> > > > >> > equally
> > > > >> > > > > useful for future AI workloads in shapes we cannot fully
> > > predict
> > > > >> > yet. The
> > > > >> > > > > fast-changing parts should live in the ecosystem and SDK
> > > layer;
> > > > the
> > > > >> > FLIP
> > > > >> > > > > should focus on more stable engine-level capabilities.
> > > > >> > > > >
> > > > >> > > > > On tactical changes vs. umbrella, I partly agree with you.
> > > Each
> > > > >> > sub-FLIP
> > > > >> > > > > should be discussed, reviewed, and accepted or rejected on
> > its
> > > > own
> > > > >> > > > merits.
> > > > >> > > > > The umbrella should not bypass the normal FLIP process,
> and
> > > > >> > accepting the
> > > > >> > > > > umbrella does not mean accepting all sub-FLIPs. That
> said, I
> > > > still
> > > > >> > think
> > > > >> > > > > the umbrella is valuable. Its purpose is not to bind the
> 11
> > > > changes
> > > > >> > into
> > > > >> > > > a
> > > > >> > > > > single inseparable package, but to help the community
> align
> > on
> > > > >> > > > principles,
> > > > >> > > > > clarify boundaries and dependencies, and avoid conflicting
> > or
> > > > >> > duplicated
> > > > >> > > > > abstractions across related capabilities.
> > > > >> > > > >
> > > > >> > > > > For example, if RpcOperator is not considered together
> with
> > > > >> > > > non-disruptive
> > > > >> > > > > scaling, it is hard to give GPU operator elasticity
> coherent
> > > > >> > semantics.
> > > > >> > > > > Deploying inference services independently is only the
> first
> > > > step;
> > > > >> > the
> > > > >> > > > > harder question is how Flink uniformly handles service
> > > > discovery,
> > > > >> > > > in-flight
> > > > >> > > > > request draining, backpressure, and failover during
> scaling.
> > > > >> Without
> > > > >> > an
> > > > >> > > > > umbrella, these capabilities can certainly be advanced as
> > > > tactical
> > > > >> > > > changes,
> > > > >> > > > > but we may end up with a set of abstractions that are
> > locally
> > > > >> usable
> > > > >> > but
> > > > >> > > > > globally inconsistent.
> > > > >> > > > >
> > > > >> > > > > On RpcOperator, I agree that we need to be very careful in
> > > > defining
> > > > >> > the
> > > > >> > > > > boundary between the Flink runtime and external
> > orchestration
> > > > >> > systems.
> > > > >> > > > > Kubernetes or the Kubernetes Operator may well be the
> right
> > > > choice
> > > > >> > at the
> > > > >> > > > > physical deployment level. But I still believe Flink
> needs a
> > > > >> > first-class
> > > > >> > > > > RpcOperator abstraction, because deployment is only part
> of
> > > the
> > > > >> > problem —
> > > > >> > > > > the harder part is its semantic integration with the Flink
> > > job.
> > > > >> > > > >
> > > > >> > > > > If model inference is part of the logical data flow, Flink
> > > > needs at
> > > > >> > > > minimum
> > > > >> > > > > to be aware of its service discovery, backpressure
> behavior,
> > > > >> failover
> > > > >> > > > > behavior, in-flight request draining, and scaling
> > > coordination.
> > > > If
> > > > >> > it is
> > > > >> > > > > hidden entirely behind an external black-box service, it
> is
> > > hard
> > > > >> for
> > > > >> > > > Flink
> > > > >> > > > > to provide consistent job-level semantics and operational
> > > > >> experience.
> > > > >> > > > >
> > > > >> > > > > So the point of RpcOperator is not necessarily that "every
> > > > physical
> > > > >> > > > process
> > > > >> > > > > must be directly launched and managed by Flink core," but
> > that
> > > > >> Flink
> > > > >> > > > needs
> > > > >> > > > > to define a service-style operator contract that allows
> such
> > > > >> > operators to
> > > > >> > > > > be invoked correctly by the data flow, coordinated
> correctly
> > > by
> > > > the
> > > > >> > > > > runtime, and understood and operated by users as part of a
> > > Flink
> > > > >> job.
> > > > >> > > > >
> > > > >> > > > > On vectorized batch processing, I agree the long-term
> > > direction
> > > > >> > should
> > > > >> > > > not
> > > > >> > > > > stop at Python. Native columnar / vectorized execution is
> an
> > > > >> > end-to-end
> > > > >> > > > > problem that touches connectors, formats, the type system,
> > > > runtime,
> > > > >> > Java,
> > > > >> > > > > SQL, and Python. The current proposal starts from the
> > > > Java/Python
> > > > >> > > > boundary
> > > > >> > > > > because that is where the row/column conversion overhead
> is
> > > most
> > > > >> > visible.
> > > > >> > > > > End-to-end columnar execution on the Java and SQL side
> > > deserves
> > > > to
> > > > >> be
> > > > >> > > > > discussed further as a separate, larger FLIP.
> > > > >> > > > >
> > > > >> > > > > On multimodal types and SerDes complexity, I agree this
> > needs
> > > > to be
> > > > >> > > > handled
> > > > >> > > > > carefully. Making AI-related objects first-class does not
> > > imply
> > > > >> that
> > > > >> > > > every
> > > > >> > > > > connector must immediately and fully support image, video,
> > > > audio,
> > > > >> > tensor,
> > > > >> > > > > and so on. The concrete incremental path, fallback
> strategy,
> > > and
> > > > >> the
> > > > >> > > > > boundary between formats, connector API, and the type
> system
> > > > will
> > > > >> be
> > > > >> > > > > discussed further in the corresponding sub-FLIPs.
> > > > >> > > > >
> > > > >> > > > > Coming back to the core of the proposal: it is not about
> > > turning
> > > > >> > Flink
> > > > >> > > > into
> > > > >> > > > > an AI framework. It is about making the core objects and
> > > > execution
> > > > >> > > > patterns
> > > > >> > > > > of AI-oriented data processing first-class citizens in
> > Flink.
> > > > >> > > > >
> > > > >> > > > > Best,
> > > > >> > > > > Guowei
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Thu, Apr 30, 2026 at 5:37 AM Yaroslav Tkachenko <
> > > > >> > [email protected]
> > > > >> > > > >
> > > > >> > > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi Guowei,
> > > > >> > > > > >
> > > > >> > > > > > Thank you for writing this proposal.
> > > > >> > > > > >
> > > > >> > > > > > I may be in the minority here, but I hope my voice will
> be
> > > > >> heard. I
> > > > >> > > > > > disagree with turning Flink into an "AI-Native" engine.
> > > > >> > > > > >
> > > > >> > > > > > Regarding your "Data processing is entering the AI era,
> > and
> > > > Flink
> > > > >> > > > needs to
> > > > >> > > > > > evolve from a traditional BI compute engine into a data
> > > engine
> > > > >> that
> > > > >> > > > > > natively supports AI workloads" claim:
> > > > >> > > > > >
> > > > >> > > > > > - How exactly do you define "AI"? I don't believe there
> > is a
> > > > >> > standard
> > > > >> > > > > > definition. For example, Machine Learning have been
> around
> > > for
> > > > >> more
> > > > >> > > > than a
> > > > >> > > > > > decade, but there were no proposals (or need, in my
> > opinion)
> > > > to
> > > > >> > turn
> > > > >> > > > Flink
> > > > >> > > > > > into an "ML-Native" engine. Flink, in its current state,
> > has
> > > > >> > > > > > been successfully used in many systems alongside
> dedicated
> > > ML
> > > > >> > > > technologies,
> > > > >> > > > > > like feature stores. Based on the context of your
> > proposal,
> > > it
> > > > >> > looks
> > > > >> > > > like
> > > > >> > > > > > you mostly mean LLMs, so could you be specific about the
> > > > >> language?
> > > > >> > > > > > - I wouldn't call Flink "a traditional BI compute
> engine".
> > > > Flink
> > > > >> > is a
> > > > >> > > > > > general data processing technology which can be used
> for a
> > > > >> variety
> > > > >> > of
> > > > >> > > > use
> > > > >> > > > > > cases without any BI involvement.
> > > > >> > > > > > - Do you have any proof that "Users' core workloads are
> > > > rapidly
> > > > >> > > > evolving"
> > > > >> > > > > > and that they require your proposed changes? Case
> studies,
> > > > user
> > > > >> > > > surveys, or
> > > > >> > > > > > submitted issues about the lack of support? Big changes
> > like
> > > > that
> > > > >> > > > require
> > > > >> > > > > > extensive validation.
> > > > >> > > > > > - And even if there is a real need to adopt some
> > LLM-driven
> > > > >> > changes,
> > > > >> > > > why
> > > > >> > > > > > now? The LLM-related tooling has been changing so
> rapidly,
> > > and
> > > > >> it's
> > > > >> > > > hard to
> > > > >> > > > > > predict what will be needed tomorrow. Why does it make
> > sense
> > > > to
> > > > >> > > > introduce
> > > > >> > > > > > changes now, and not wait for more standardization and
> > > > >> > consolidation?
> > > > >> > > > > >
> > > > >> > > > > > To summarize, I think there are a lot of great ideas in
> > the
> > > > >> > proposal,
> > > > >> > > > but
> > > > >> > > > > > in my mind, they need to be addressed as tactical,
> focused
> > > > >> > changes, not
> > > > >> > > > > > under the "AI-Native" umbrella.
> > > > >> > > > > >
> > > > >> > > > > > I also wanted to address a few more specific points:
> > > > >> > > > > >
> > > > >> > > > > > - RpcOperator, why does it need to be managed by Flink?
> I
> > > see
> > > > >> > > > absolutely no
> > > > >> > > > > > need to introduce the additional complexity of
> > orchestrating
> > > > >> > standalone
> > > > >> > > > > > components into the core Flink engine. I can imagine a
> > > > separate
> > > > >> > > > sub-project
> > > > >> > > > > > for an RpcOperator, which could potentially be managed
> by
> > > the
> > > > >> > > > Kubernetes
> > > > >> > > > > > Operator.
> > > > >> > > > > > - You make the case for the vectorized batch processing,
> > but
> > > > only
> > > > >> > on
> > > > >> > > > the
> > > > >> > > > > > Python side. Why stop there? Native columnar vectorized
> > > > execution
> > > > >> > will
> > > > >> > > > > > require end-to-end changes, including connectors, data
> > > format
> > > > >> > support,
> > > > >> > > > Type
> > > > >> > > > > > system support, runtime changes, etc. It seems logical
> to
> > me
> > > > to
> > > > >> > support
> > > > >> > > > > > this execution mode for Java and SQL as well.
> > > > >> > > > > > - Supporting many more data types natively (images,
> video,
> > > > audio,
> > > > >> > > > tensors)
> > > > >> > > > > > will make connector serializers and deserializers
> (SerDes)
> > > > much
> > > > >> > more
> > > > >> > > > > > challenging to implement. Even today, many SerDes in
> > > > officially
> > > > >> > > > supported
> > > > >> > > > > > connectors don't fully implement types like arrays and
> > > > structs.
> > > > >> > > > > >
> > > > >> > > > > > Thank you.
> > > > >> > > > > >
> > > > >> > > > > > On Wed, Apr 29, 2026 at 1:18 AM Guowei Ma <
> > > > [email protected]>
> > > > >> > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hi Z
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks for the kind words and the thoughtful
> questions.
> > > Let
> > > > me
> > > > >> > take
> > > > >> > > > them
> > > > >> > > > > > > one by one.
> > > > >> > > > > > >
> > > > >> > > > > > >    1. Throughput and latency targets
> > > > >> > > > > > >
> > > > >> > > > > > > To be honest, I don't have concrete numbers to share
> > yet.
> > > > What
> > > > >> I
> > > > >> > can
> > > > >> > > > say
> > > > >> > > > > > is
> > > > >> > > > > > > that our internal testing has already surfaced several
> > > > >> directions
> > > > >> > > > where
> > > > >> > > > > > > Flink can be improved, and at the same time we want to
> > > fully
> > > > >> > leverage
> > > > >> > > > > > > Flink's existing streaming shuffle capabilities. As
> the
> > > > >> > multimodal
> > > > >> > > > > > operator
> > > > >> > > > > > > library matures, we'll progressively publish benchmark
> > > > results.
> > > > >> > > > > > >
> > > > >> > > > > > >    2. Built-in operators
> > > > >> > > > > > >
> > > > >> > > > > > > You're absolutely right. From what I've seen, our
> > internal
> > > > >> users
> > > > >> > > > already
> > > > >> > > > > > > rely on a fairly large set of multimodal operators —
> > > > >> potentially
> > > > >> > > > 100+.
> > > > >> > > > > > The
> > > > >> > > > > > > exact set the community should provide is best
> discussed
> > > in
> > > > >> > FLIP-XXX:
> > > > >> > > > > > > Built-in Multimodal Operators and AI Functions, and
> > > > >> contributions
> > > > >> > > > from
> > > > >> > > > > > the
> > > > >> > > > > > > community are very welcome there.
> > > > >> > > > > > >
> > > > >> > > > > > >    3. Plan for the 11 sub-FLIPs
> > > > >> > > > > > >
> > > > >> > > > > > > The sequencing follows the layering in the umbrella:
> > > > >> > > > > > >
> > > > >> > > > > > >    - Layer 1 (Core Primitives) should be discussed and
> > > > aligned
> > > > >> > first,
> > > > >> > > > > > since
> > > > >> > > > > > >    the second and third layers build on it.
> > > > >> > > > > > >    - Layer 2 (API + compilation + single-node
> execution)
> > > > starts
> > > > >> > with
> > > > >> > > > > > >    getting the API discussion right — the Python API,
> > how
> > > > UDFs
> > > > >> > > > declare
> > > > >> > > > > > >    resources, etc. — after which the single-node
> > execution
> > > > work
> > > > >> > can
> > > > >> > > > build
> > > > >> > > > > > > on
> > > > >> > > > > > >    top.
> > > > >> > > > > > >    - Layer 3 (distributed scheduling and
> checkpointing)
> > > can
> > > > >> > largely
> > > > >> > > > > > proceed
> > > > >> > > > > > >    independently in parallel.
> > > > >> > > > > > >
> > > > >> > > > > > > So while each sub-FLIP is indeed a substantial piece
> of
> > > > work,
> > > > >> > most of
> > > > >> > > > > > them
> > > > >> > > > > > > can be advanced in parallel by different contributors
> > once
> > > > the
> > > > >> > Layer
> > > > >> > > > 1
> > > > >> > > > > > > primitives are settled.
> > > > >> > > > > > >
> > > > >> > > > > > >    4. GPU scheduling roadmap
> > > > >> > > > > > >
> > > > >> > > > > > > Could you expand a bit on which aspect of GPU
> scheduling
> > > you
> > > > >> > have in
> > > > >> > > > mind
> > > > >> > > > > > > as the complex one? "GPU scheduling" covers a fairly
> > wide
> > > > >> surface
> > > > >> > > > area
> > > > >> > > > > > > (resource declaration, operator-level deployment,
> > elastic
> > > > >> > scaling,
> > > > >> > > > > > > heterogeneous GPU types, fine-grained partitioning,
> > etc.),
> > > > and
> > > > >> > the
> > > > >> > > > answer
> > > > >> > > > > > > differs quite a bit depending on which dimension we're
> > > > >> > discussing.
> > > > >> > > > Once I
> > > > >> > > > > > > understand your specific concern I can give a more
> > useful
> > > > >> > response.
> > > > >> > > > > > >
> > > > >> > > > > > > Thanks again for the support — looking forward to the
> > > > continued
> > > > >> > > > > > discussion.
> > > > >> > > > > > >
> > > > >> > > > > > > Best,
> > > > >> > > > > > > Guowei
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > On Tue, Apr 28, 2026 at 4:34 PM zl z <
> > > [email protected]
> > > > >
> > > > >> > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Hey Guowei,
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thanks for the proposal, and I think this is very
> > > > valuable. I
> > > > >> > have
> > > > >> > > > some
> > > > >> > > > > > > > question about it:
> > > > >> > > > > > > >
> > > > >> > > > > > > > 1. What are our expected throughput and latency
> > targets?
> > > > Do
> > > > >> we
> > > > >> > > > have any
> > > > >> > > > > > > > forward-looking tests for this?
> > > > >> > > > > > > >
> > > > >> > > > > > > > 2. AI involves a very large number of operators.
> > Besides
> > > > >> > allowing
> > > > >> > > > users
> > > > >> > > > > > > to
> > > > >> > > > > > > > use them through UDFs, will we also provide commonly
> > > used
> > > > >> > built-in
> > > > >> > > > > > > > operators?
> > > > >> > > > > > > >
> > > > >> > > > > > > > 3. Each of the 11 sub-FLIPs is a major project
> > > involving a
> > > > >> > > > significant
> > > > >> > > > > > > > amount of changes. What is our plan for this?
> > > > >> > > > > > > >
> > > > >> > > > > > > > 4. GPU scheduling is extremely complex. What is our
> > > > current
> > > > >> > > > roadmap for
> > > > >> > > > > > > > this?
> > > > >> > > > > > > >
> > > > >> > > > > > > > This is a very high-quality and exciting proposal.
> > > Making
> > > > >> > Flink an
> > > > >> > > > > > > > AI-native data processing engine will make it far
> more
> > > > >> > valuable in
> > > > >> > > > the
> > > > >> > > > > > AI
> > > > >> > > > > > > > era. Look forward to seeing it land and come to
> > fruition
> > > > >> soon.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Robert Metzger <[email protected]> 于2026年4月28日周二
> > > > 14:38写道：
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hey Guowei,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Thanks for the proposal. I just took a brief look,
> > > here
> > > > are
> > > > >> > some
> > > > >> > > > high
> > > > >> > > > > > > > level
> > > > >> > > > > > > > > questions:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Regarding the RPC Operator: What is the difference
> > to
> > > > the
> > > > >> > async
> > > > >> > > > io
> > > > >> > > > > > > > operator
> > > > >> > > > > > > > > we have already?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > "Connector API for Multimodal Data Source/Sink":
> Why
> > > do
> > > > we
> > > > >> > need
> > > > >> > > > to
> > > > >> > > > > > > touch
> > > > >> > > > > > > > > the connector API for supporting multimodal data?
> > > Isn't
> > > > >> this
> > > > >> > > > more of
> > > > >> > > > > > a
> > > > >> > > > > > > > > formats concern?
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > "Non-Disruptive Scaling for CPU Operators": How do
> > you
> > > > want
> > > > >> > to
> > > > >> > > > > > > guarantee
> > > > >> > > > > > > > > exactly-once on that kind of scaling? E.g. you
> need
> > to
> > > > >> > somehow
> > > > >> > > > make a
> > > > >> > > > > > > > > handover between the old and new new pipeline
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Overall, I find the proposal has some things which
> > > seem
> > > > >> > related
> > > > >> > > > to
> > > > >> > > > > > > making
> > > > >> > > > > > > > > Flink more AI native, but other changes seem
> > > orthogonal
> > > > to
> > > > >> > that.
> > > > >> > > > For
> > > > >> > > > > > > > > example the checkpoint or scaling changes are
> > actually
> > > > >> > unrelated
> > > > >> > > > to
> > > > >> > > > > > AI,
> > > > >> > > > > > > > and
> > > > >> > > > > > > > > just engine improvements.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Tue, Apr 28, 2026 at 5:48 AM Guowei Ma <
> > > > >> > [email protected]>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Hi everyone,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > I'd like to start a discussion on an umbrella
> > > FLIP[1]
> > > > >> that
> > > > >> > lays
> > > > >> > > > > > out a
> > > > >> > > > > > > > > > direction for evolving Flink into a data engine
> > that
> > > > >> > natively
> > > > >> > > > > > > supports
> > > > >> > > > > > > > AI
> > > > >> > > > > > > > > > workloads.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > The short version: user workloads are shifting
> > from
> > > BI
> > > > >> > > > analytics to
> > > > >> > > > > > > > > > multimodal data processing centered on model
> > > > inference,
> > > > >> and
> > > > >> > > > this
> > > > >> > > > > > > > triggers
> > > > >> > > > > > > > > > cascading changes across the stack — multimodal
> > data
> > > > >> > flowing
> > > > >> > > > > > through
> > > > >> > > > > > > > > > pipelines, heterogeneous CPU/GPU resources,
> > > vectorized
> > > > >> > > > execution,
> > > > >> > > > > > and
> > > > >> > > > > > > > > > inference tasks that run for seconds to minutes
> on
> > > > Spot
> > > > >> > > > instances.
> > > > >> > > > > > > The
> > > > >> > > > > > > > > > proposal sketches an evolution along five
> > directions
> > > > >> > > > (development
> > > > >> > > > > > > > > paradigm,
> > > > >> > > > > > > > > > data model, heterogeneous resources, execution
> > > engine,
> > > > >> > fault
> > > > >> > > > > > > > tolerance),
> > > > >> > > > > > > > > > decomposed into 11 sub-FLIPs organized into
> three
> > > > layers:
> > > > >> > core
> > > > >> > > > > > > runtime
> > > > >> > > > > > > > > > primitives, AI workload expression and
> execution,
> > > and
> > > > >> > > > > > > production-grade
> > > > >> > > > > > > > > > operational guarantees. Most sub-FLIPs have no
> > hard
> > > > >> > > > dependencies on
> > > > >> > > > > > > > each
> > > > >> > > > > > > > > > other and can be advanced in parallel.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > A note on scope, since it's an umbrella:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > - In scope here: whether the evolution
> directions
> > > are
> > > > >> > > > reasonable,
> > > > >> > > > > > > > whether
> > > > >> > > > > > > > > > each sub-FLIP's motivation and proposed approach
> > are
> > > > >> > > > well-founded,
> > > > >> > > > > > > and
> > > > >> > > > > > > > > > whether the boundaries and dependencies between
> > > > sub-FLIPs
> > > > >> > are
> > > > >> > > > > > clear.
> > > > >> > > > > > > > > > - Out of scope here: detailed designs, API
> > > specifics,
> > > > and
> > > > >> > > > > > > > implementation
> > > > >> > > > > > > > > > plans of individual sub-FLIPs — those will go
> > > through
> > > > >> > their own
> > > > >> > > > > > > FLIPs.
> > > > >> > > > > > > > > > - Consensus criteria: agreement on the overall
> > > > direction
> > > > >> is
> > > > >> > > > > > > sufficient
> > > > >> > > > > > > > > for
> > > > >> > > > > > > > > > the umbrella to pass; passing it does not lock
> in
> > > any
> > > > >> > > > sub-FLIP's
> > > > >> > > > > > > > design —
> > > > >> > > > > > > > > > sub-FLIPs may still be adjusted, deferred, or
> > > > withdrawn
> > > > >> as
> > > > >> > they
> > > > >> > > > > > > > progress.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > All proposed changes are incremental — no
> existing
> > > > API or
> > > > >> > > > behavior
> > > > >> > > > > > is
> > > > >> > > > > > > > > > removed or altered. Compatibility details are
> > > covered
> > > > at
> > > > >> > the
> > > > >> > > > end of
> > > > >> > > > > > > the
> > > > >> > > > > > > > > > document.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Looking forward to your feedback on the overall
> > > > direction
> > > > >> > and
> > > > >> > > > the
> > > > >> > > > > > > > > layering.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > [1]
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=421957275
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Thanks,
> > > > >> > > > > > > > > > Guowei
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>

Re: Re: [DISCUSS] FLIP-577: AI-Native Flink — An Umbrella Proposal for Multimodal Data Processing

Reply via email to