Re: [DISCUSS] FLIP-531: Initiate Flink Agents as a new Sub-Project

Xintong Song Tue, 27 May 2025 20:19:09 -0700

Thanks everyone for the positive feedback. It seems to me most people are
in support of this proposal.



I'll wait for a couple of more days, and then start the vote if no more
concern is raised.


Best,

Xintong



On Mon, May 26, 2025 at 3:09 PM Jing Ge <[email protected]> wrote:

> Fair enough! Thanks Xintong for the clarification! Looking forward to it!
>
> Best regards,
> Jing
>
> On Mon, May 26, 2025 at 3:48 AM Xintong Song <[email protected]>
> wrote:
>
> > @Robert,
> >
> > 1. I wasn't aware of the ASF subproject concept. Yes, the intention here
> is
> > to create a repository, just like flink-cdc, flink-kubernetes-operator.
> > I'll add a clarification in the FLIP.
> >
> > 2. Sorry for the confusion. I think we are just using Kafka as an example
> > here. I'll correct it in the FLIP.
> >
> > I think there are two different ways to build a multi-agent system, and
> we
> > plan to support both.
> >
> >    - Running multiple agents running in the same Flink job. This means
> >    managing the lifecycle of the agents as a whole, and end-to-end
> >    checkpointing consistency across them. For this case, we do plan to
> >    leverage StateFun for the communication. In the first step, we
> probably
> >    will simply depend on StateFun, to quickly get it work. In the long
> > term, I
> >    think it makes sense to move codes that we want to reuse from StateFun
> > into
> >    the new project, rather than depending on a no-longer-maintained
> > project.
> >    - Running agents as separated Flink jobs. In this way, I agree we
> should
> >    leverage Flink's connector framework.
> >
> > Thanks for pointing out the issues.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Sun, May 25, 2025 at 10:38 AM Yuan Mei <[email protected]>
> wrote:
> >
> > > Thanks Xintong, Sean and Chris.
> > >
> > > This is a great step forward for the future of Flink. I'm really
> looking
> > > forward to it!
> > >
> > > Best,
> > > Yuan
> > >
> > > On Sat, May 24, 2025 at 10:00 PM Robert Metzger <[email protected]>
> > > wrote:
> > >
> > > > Thanks for the nice proposal.
> > > >
> > > > One question: The proposal talks a lot about establishing a "sub
> > > project".
> > > > If I understand correctly, the ASF has a concept of subprojects, with
> > > > sub-project committers, mailing lists, jira projects, .. etc. [1][2].
> > > >
> > > > Is the intention of this proposal to establish such a sub project?
> > > > Or is the intention to basically create a "flink-agents" git
> > repository,
> > > > where all existing Flink committers have access to, and the Flink PMC
> > > votes
> > > > on releases? (I assume this is the intention). If so, I would update
> > the
> > > > proposal to talk about a new repository? or at least clarify the
> > > immediate
> > > > implications for the project.
> > > >
> > > > My second question is about this key feature:
> > > > > *Inter-Agent Communication:* Built-in support for asynchronous
> > > > agent-to-agent communication using Kafka.
> > > >
> > > > Does this mean the code from the flink-agents repo will have a
> > dependency
> > > > on AK? One of the big benefits of Flink is that it is independent of
> > the
> > > > underlying message streaming system. Wouldn't it be more elegant and
> > > > actually easier to rely on the Flink connector framework here, and
> > leave
> > > > the concrete implementation to the user?
> > > > Also, I wonder why we need to rely on an external message streaming
> > > system
> > > > at all? Is it because we want to be able to send messages into
> > arbitrary
> > > > directions? if so, maybe we can re-use code from Flink Statefun? I
> > > > personally would think that relying on Flink's internal data transfer
> > > model
> > > > by default brings a lot of cost, performance, operations and
> > > implementation
> > > > benefits ... and users can still manually setup a connector using a
> > > Kafka,
> > > > Pulsar or PubSub connection. WDYT?
> > > >
> > > > Best,
> > > > Robert
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Sub+Projects
> > > > [2]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HADOOP/Apache+Hadoop+Ozone+-+sub-project+to+Apache+TLP+proposal
> > > >
> > > >
> > > > On Fri, May 23, 2025 at 6:14 AM Xintong Song <[email protected]>
> > > > wrote:
> > > >
> > > > > @Jing,
> > > > >
> > > > > I think the FLIP already included the high-level design goals, by
> > > listing
> > > > > the key features that we plan to support in the Proposed Solution
> > > > section,
> > > > > and demonstrating how using the framework may look like with the
> code
> > > > > examples. Of course the high-level goals need to be further
> detailed,
> > > > which
> > > > > will be the next step. The purpose of this FLIP is to get community
> > > > > consensus on initiating this new project. On the other hand,
> > technical
> > > > > design takes time to discuss, and likely requires continuous
> > iteration
> > > as
> > > > > the project is being developed. So I think it makes sense to
> separate
> > > the
> > > > > design discussions from the initiation proposal.
> > > > >
> > > > > Of course any contributor's thoughts and inputs are valuable to the
> > > > > project. And efficiency also matters, as the agentic ai industry
> > grows
> > > > > fast, we really need to keep up with the pace. I believe it would
> be
> > > more
> > > > > efficient to come up with some initial draft design /
> implementation
> > > that
> > > > > everyone can comment on, compared to just randomly collecting ideas
> > > when
> > > > we
> > > > > have nothing. Fortunately, the project is at the early stage with
> no
> > > > > historical burdens, which means we don't need to carefully make
> sure
> > > > > everything is perfect in advance, and can always correct / change /
> > > > rework
> > > > > things if needed. We can at least do that before we commit to the
> > > product
> > > > > compatibility with the first formal release. This is why we
> suggested
> > > > > applying a light, execution-first process, as mentioned in the
> > > Operating
> > > > > Model section. I would not be concerned too much about not
> collecting
> > > > > enough inputs at the beginning, because we can always adjust things
> > > > > afterwards based on new suggestions and opinions.
> > > > >
> > > > > Best,
> > > > >
> > > > > Xintong
> > > > >
> > > > >
> > > > >
> > > > > On Fri, May 23, 2025 at 12:13 AM Jing Ge
> <[email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > It is great to see that everyone in this thread agreed with the
> > > > > high-level
> > > > > > proposal. Just so excited and could not stop asking questions :-)
> > > > Thanks
> > > > > > Xintong for the update!
> > > > > >
> > > > > > I'd like to share a little bit more thoughts with my questions
> and
> > > your
> > > > > > additional input. And then lead to a small suggestion.
> > > > > >
> > > > > > 1. It is great to support freestyle tools beyond MCP protocol,
> from
> > > > users
> > > > > > perspective. However, if we consider agent framework design,
> there
> > > > might
> > > > > be
> > > > > > some choices to make. For example, either we stick to MCP
> > internally
> > > > and
> > > > > > turn such external freestyle tools into MCP internally or we will
> > > > design
> > > > > a
> > > > > > new abstraction to handle diverse function calls offered by
> > different
> > > > > > LLMs, kind of repeating what MCP did.  Another thought, which I
> > feel,
> > > > is
> > > > > > that the sample API in the FLIP shows somehow, as a user, after a
> > MCP
> > > > > > server registration, I could use the close follow-up prompt()
> > method
> > > to
> > > > > > modify/extend the standard out-of-box context provided by the MCP
> > > > server.
> > > > > > But it is too detailed and should not be discussed in this
> > high-level
> > > > > > thread. Happy to join any (offline) discussion and contribute.
> > > > > >
> > > > > > 3. Similar to microservices, there are a few use cases that are
> > > > sensitive
> > > > > > to the response latency, e.g. stock trading, etc. But it is
> totally
> > > > fine
> > > > > to
> > > > > > focus on asynchronous communication.
> > > > > >
> > > > > > 4. because each of them has individual focus and needs effort to
> > > build.
> > > > > It
> > > > > > was a question of priorities. Good to know Flink Agent wants to
> > cover
> > > > > both.
> > > > > >
> > > > > > 5. Great to know. I had a similar thought and was a little bit
> > > > confused,
> > > > > > because state is more or less a low level concept for operators.
> > > > Looking
> > > > > > forward to understanding how to use it as agent memory.
> > > > > >
> > > > > > What I actually tried to suggest with all these questions is:
> Does
> > it
> > > > > make
> > > > > > sense to define some high-level design
> goals/criterias/guidelines?
> > > > > like(as
> > > > > > an example):
> > > > > >
> > > > > > 1. support MCP natively
> > > > > > 2. single Agent development (for the first stage)
> > > > > > 3. only support event-driven asynchronous communication
> > > > > > 4. agent framework for both embedding and workflow
> development(same
> > > > > > priority)
> > > > > > 5. Flink state as memory
> > > > > > 6. support ReAct, don't support ReWOO (just as an example to show
> > my
> > > > > > thought. In reality, ReWOO might be useful for some enterprise
> > agents
> > > > > > considering the deterministic process. An example topic to be
> > > > discussed.)
> > > > > >
> > > > > > Any contributors in the community can also share their thoughts
> > about
> > > > any
> > > > > > high level design guidelines to be collected at an early stage.
> > > > > >
> > > > > > The final chosen high-level guidelines could help let everyone on
> > the
> > > > > same
> > > > > > page to understand and design the upcoming architecture and might
> > > also
> > > > > have
> > > > > > influence on the future API design. WDYT?
> > > > > >
> > > > > > Best regards,
> > > > > > Jing
> > > > > >
> > > > > >
> > > > > > On Thu, May 22, 2025 at 4:55 AM Xintong Song <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks everyone for the positive feedback.
> > > > > > >
> > > > > > > As I said, this FLIP is intended for discussing high-level
> plans
> > > for
> > > > > the
> > > > > > > new project. The project itself is still at an early stage, and
> > > some
> > > > of
> > > > > > the
> > > > > > > technical designs and solutions are not completely ready yet.
> So
> > > atm
> > > > I
> > > > > > can
> > > > > > > only share some personal thoughts on the raised questions, and
> we
> > > are
> > > > > > open
> > > > > > > to suggestions and opinions.
> > > > > > >
> > > > > > > @Jing
> > > > > > >
> > > > > > > 1. Regarding MCP, I think it's just one way (and likely a major
> > > way)
> > > > > for
> > > > > > > providing LLMs with context, but not the only way. E.g., a user
> > may
> > > > > > write a
> > > > > > > dedicated python function and provide it to the LLM as a tool,
> > > which
> > > > > > > doesn't necessarily need to go through the MCP protocol. At the
> > > same,
> > > > > the
> > > > > > > LLM may discover more available tools from a MCP server. These
> > are
> > > > > just 2
> > > > > > > different sources that the tools come from, and they can
> > co-exist.
> > > > > > >
> > > > > > > 2. In the long-term, yes, I think. As a first step, we probably
> > > will
> > > > be
> > > > > > > more focused on how to build individual agents, less on
> > > interactions
> > > > > > across
> > > > > > > multiple agents.  Not saying we won't support MAS in the first
> > > step,
> > > > > but
> > > > > > > maybe not as complex as the A2A protocol.
> > > > > > >
> > > > > > > 3. Interactions between agents will be event-driven, so they
> are
> > > > > > naturally
> > > > > > > asynchronous. I'm not entirely sure about use cases that prefer
> > > > > > > asynchronous agent calls. Could you share some examples?
> > > > > > >
> > > > > > > 4. I think I didn't fully get the taxonomy here. I mean why
> > > embedding
> > > > > vs.
> > > > > > > workflow? From my understanding, I think Flink Agents should
> > cover
> > > > both
> > > > > > use
> > > > > > > cases.
> > > > > > >
> > > > > > > 5. Yes, memory is considered. Actually, Flink's state
> management
> > > > makes
> > > > > a
> > > > > > > good foundation for supporting agent memory.
> > > > > > >
> > > > > > > @Nishita
> > > > > > >
> > > > > > > 1. I think calling an external LLM is similar to an async
> > operator
> > > in
> > > > > > > Flink, in terms of potential latency and backpressure issues.
> > > Flink's
> > > > > > async
> > > > > > > operator already supports concurrent async calls, rate control,
> > > > timeout
> > > > > > > handling, etc. But eventually, the bottleneck is at the
> external
> > > > > service
> > > > > > > side, and we expect the model techniques will keep improving,
> > with
> > > > > larger
> > > > > > > throughput, less latency, and better stability.
> > > > > > >
> > > > > > > 2. Good question. I think real-time event-driven processing is
> > > > somehow
> > > > > in
> > > > > > > conflict with asynchronous human-in-the-loop feedback. One idea
> > is
> > > > > that,
> > > > > > > I've seen people doing this way, to build another agent for
> > > > validating
> > > > > > > results and generating feedback. Another idea is to collect
> > samples
> > > > of
> > > > > > > results for asynchronous human-in-the-loop validations. But
> these
> > > are
> > > > > > just
> > > > > > > rough ideas. I don't have sophisticated answers at the moment.
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Xintong
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, May 22, 2025 at 3:26 AM Yash Anand
> > > > <[email protected]
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thank you for the proposal—this initiative will make it much
> > > easier
> > > > > to
> > > > > > > > build event-driven AI agents seamlessly.
> > > > > > > >
> > > > > > > > +1 for the proposed Flink Agents sub-project!
> > > > > > > >
> > > > > > > > On Wed, May 21, 2025 at 9:43 AM Mayank Juneja <
> > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1 on the FLIP. This is a solid step toward building an
> > agentic
> > > > > > > offering
> > > > > > > > > that really leans into Flink’s strengths, and builds on the
> > > > > momentum
> > > > > > > from
> > > > > > > > > recent API improvements like FLIP-437 and the proposed
> > > FLIP-529.
> > > > > > > > >
> > > > > > > > > Also wanted to echo the point around agent memory. More
> > > advanced
> > > > > > > agentic
> > > > > > > > > systems really benefit from both short-term and long-term
> > > memory.
> > > > > > While
> > > > > > > > > long-term memory can live in databases (including vector
> > > stores),
> > > > > > > having
> > > > > > > > a
> > > > > > > > > built-in abstraction for managing short-term memory would
> be
> > > > super
> > > > > > > > useful.
> > > > > > > > > Doesn’t need to be in the MVP, but definitely worth
> > considering
> > > > for
> > > > > > the
> > > > > > > > > roadmap.
> > > > > > > > > Best,
> > > > > > > > > Mayank
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, May 21, 2025 at 4:54 PM Lincoln Lee <
> > > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 for the proposed flink agents sub-project!
> > > > > > > > > >
> > > > > > > > > > This aligns perfectly with flink's core strengths in
> > > real-time
> > > > > > event
> > > > > > > > > > processing and stateful computations.
> > > > > > > > > >
> > > > > > > > > > Thanks for driving this initiative and looking forward to
> > the
> > > > > > > > > > detailed technical designs.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Lincoln Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hao Li <[email protected]> 于2025年5月21日周三 23:28写道：
> > > > > > > > > >
> > > > > > > > > > > Hi Xintong, Sean and Chris,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for driving the initiative. Very exciting to
> bring
> > > AI
> > > > > > Agent
> > > > > > > to
> > > > > > > > > > Flink
> > > > > > > > > > > to empower the streaming use cases.
> > > > > > > > > > >
> > > > > > > > > > > +1 to the FLIP.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Hao
> > > > > > > > > > >
> > > > > > > > > > > On Wed, May 21, 2025 at 7:35 AM Nishita Pattanayak <
> > > > > > > > > > > [email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Sean, Chris and Xintong. This seems to be a very
> > > > exciting
> > > > > > > > > > sub-project.
> > > > > > > > > > > > +1 for "flink-agents" sub-project.
> > > > > > > > > > > >
> > > > > > > > > > > > I was going through the FLIP , and had some questions
> > > > > regarding
> > > > > > > the
> > > > > > > > > > same:
> > > > > > > > > > > > 1. How would the external model calls (e.g., OpenAI
> or
> > > > > internal
> > > > > > > > LLMs)
> > > > > > > > > > > > integrated into Flink tasks without introducing
> > > > backpressure
> > > > > or
> > > > > > > > > latency
> > > > > > > > > > > > issues?
> > > > > > > > > > > > In my experience, calling an external LLM has the
> > > following
> > > > > > > > > > > > risks: Latency-sensitive (LLM inference can take
> > hundreds
> > > > of
> > > > > > > > > > milliseconds
> > > > > > > > > > > > to seconds), Flaky (network issues, rate limits) as
> > well
> > > as
> > > > > it
> > > > > > > > > > > > is Non-deterministic (with timeouts, retries, etc.).
> It
> > > > would
> > > > > > be
> > > > > > > > > great
> > > > > > > > > > to
> > > > > > > > > > > > work/brainstorm on how we solve these issues.
> > > > > > > > > > > > 2. In traditional agent workflows, user feedback
> often
> > > > plays
> > > > > a
> > > > > > > key
> > > > > > > > > role
> > > > > > > > > > > in
> > > > > > > > > > > > validating and improving agent outputs. In a
> > continuous,
> > > > > > > > long-running
> > > > > > > > > > > > Flink-based agent system, where interactions might
> not
> > be
> > > > > > > > user-facing
> > > > > > > > > > or
> > > > > > > > > > > > synchronous, how do we incorporate human-in-the-loop
> > > > feedback
> > > > > > or
> > > > > > > > > > > > correctness signals to validate and iteratively
> improve
> > > > agent
> > > > > > > > > behavior?
> > > > > > > > > > > >
> > > > > > > > > > > > This is a really exciting direction for the Flink
> > > > ecosystem.
> > > > > > The
> > > > > > > > idea
> > > > > > > > > > of
> > > > > > > > > > > > building long-running, context-aware agents natively
> on
> > > > Flink
> > > > > > > feels
> > > > > > > > > > like
> > > > > > > > > > > a
> > > > > > > > > > > > natural evolution of stream processing. I'd love to
> see
> > > > this
> > > > > > > mature
> > > > > > > > > and
> > > > > > > > > > > > would be excited to contribute in any way I can to
> help
> > > > > > > > productionize
> > > > > > > > > > and
> > > > > > > > > > > > validate this in real-world use cases.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, May 21, 2025 at 8:52 AM Xintong Song <
> > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi devs,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Sean, Chris and I would like to start a discussion
> on
> > > > > > FLIP-531
> > > > > > > > [1],
> > > > > > > > > > > about
> > > > > > > > > > > > > introducing a new sub-project, Flink Agents.
> > > > > > > > > > > > >
> > > > > > > > > > > > > With the rise of agentic AI, we have identified
> great
> > > new
> > > > > > > > > > opportunities
> > > > > > > > > > > > for
> > > > > > > > > > > > > Flink, particularly in the system-triggered agent
> > > > > scenarios.
> > > > > > We
> > > > > > > > > > believe
> > > > > > > > > > > > the
> > > > > > > > > > > > > future of AI agent applications is industrialized,
> > > where
> > > > > > agents
> > > > > > > > > will
> > > > > > > > > > > not
> > > > > > > > > > > > > only be triggered by users, but increasingly by
> > systems
> > > > as
> > > > > > > well.
> > > > > > > > > > > Flink's
> > > > > > > > > > > > > event capabilities in real-time distributed event
> > > > > processing,
> > > > > > > > state
> > > > > > > > > > > > > management and exact-once consistency fault
> tolerance
> > > > make
> > > > > it
> > > > > > > > > > > well-suited
> > > > > > > > > > > > > as a framework for building such system-triggered
> > > agents.
> > > > > > > > > > Furthermore,
> > > > > > > > > > > > > system-triggered agents are often tightly coupled
> > with
> > > > data
> > > > > > > > > > processing.
> > > > > > > > > > > > > Flink's outstanding data processing capabilities
> > allows
> > > > > > > seamless
> > > > > > > > > > > > > integration between data and agentic processing.
> > These
> > > > > > > > capabilities
> > > > > > > > > > > > > differentiate Flink from other agent frameworks
> with
> > > > unique
> > > > > > > > > > advantages
> > > > > > > > > > > in
> > > > > > > > > > > > > the context of system-triggered agents.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We propose this effort as a sub-project of Apache
> > > Flink,
> > > > > > with a
> > > > > > > > > > > separate
> > > > > > > > > > > > > code repository and lightweight developing process,
> > for
> > > > > rapid
> > > > > > > > > > iteration
> > > > > > > > > > > > > during the early stage.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please note that this FLIP is focused on the
> > high-level
> > > > > > plans,
> > > > > > > > > > > including
> > > > > > > > > > > > > motivation, positioning, goals, roadmap, and
> > operating
> > > > > model
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > > > project. Detailed technical design is out of the
> > scope
> > > > and
> > > > > > will
> > > > > > > > be
> > > > > > > > > > > > > discussed during the rapid prototyping and
> > iterations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > For more details, please check the FLIP [1].
> Looking
> > > > > forward
> > > > > > to
> > > > > > > > > your
> > > > > > > > > > > > > feedback.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-531%3A+Initiate+Flink+Agents+as+a+new+Sub-Peoject
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > *Mayank Juneja*
> > > > > > > > > Product Manager | Data Streaming and AI
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-531: Initiate Flink Agents as a new Sub-Project

Reply via email to