FYI, I've started a vote for this FLIP: https://lists.apache.org/list.html?dev@flink.apache.org
Best, Xintong On Wed, May 28, 2025 at 11:18 AM Xintong Song <tonysong...@gmail.com> wrote: > Thanks everyone for the positive feedback. It seems to me most people are > in support of this proposal. > > > I'll wait for a couple of more days, and then start the vote if no more > concern is raised. > > > Best, > > Xintong > > > > On Mon, May 26, 2025 at 3:09 PM Jing Ge <j...@ververica.com.invalid> > wrote: > >> Fair enough! Thanks Xintong for the clarification! Looking forward to it! >> >> Best regards, >> Jing >> >> On Mon, May 26, 2025 at 3:48 AM Xintong Song <tonysong...@gmail.com> >> wrote: >> >> > @Robert, >> > >> > 1. I wasn't aware of the ASF subproject concept. Yes, the intention >> here is >> > to create a repository, just like flink-cdc, flink-kubernetes-operator. >> > I'll add a clarification in the FLIP. >> > >> > 2. Sorry for the confusion. I think we are just using Kafka as an >> example >> > here. I'll correct it in the FLIP. >> > >> > I think there are two different ways to build a multi-agent system, and >> we >> > plan to support both. >> > >> > - Running multiple agents running in the same Flink job. This means >> > managing the lifecycle of the agents as a whole, and end-to-end >> > checkpointing consistency across them. For this case, we do plan to >> > leverage StateFun for the communication. In the first step, we >> probably >> > will simply depend on StateFun, to quickly get it work. In the long >> > term, I >> > think it makes sense to move codes that we want to reuse from >> StateFun >> > into >> > the new project, rather than depending on a no-longer-maintained >> > project. >> > - Running agents as separated Flink jobs. In this way, I agree we >> should >> > leverage Flink's connector framework. >> > >> > Thanks for pointing out the issues. >> > >> > Best, >> > >> > Xintong >> > >> > >> > >> > On Sun, May 25, 2025 at 10:38 AM Yuan Mei <yuanmei.w...@gmail.com> >> wrote: >> > >> > > Thanks Xintong, Sean and Chris. >> > > >> > > This is a great step forward for the future of Flink. I'm really >> looking >> > > forward to it! >> > > >> > > Best, >> > > Yuan >> > > >> > > On Sat, May 24, 2025 at 10:00 PM Robert Metzger <rmetz...@apache.org> >> > > wrote: >> > > >> > > > Thanks for the nice proposal. >> > > > >> > > > One question: The proposal talks a lot about establishing a "sub >> > > project". >> > > > If I understand correctly, the ASF has a concept of subprojects, >> with >> > > > sub-project committers, mailing lists, jira projects, .. etc. >> [1][2]. >> > > > >> > > > Is the intention of this proposal to establish such a sub project? >> > > > Or is the intention to basically create a "flink-agents" git >> > repository, >> > > > where all existing Flink committers have access to, and the Flink >> PMC >> > > votes >> > > > on releases? (I assume this is the intention). If so, I would update >> > the >> > > > proposal to talk about a new repository? or at least clarify the >> > > immediate >> > > > implications for the project. >> > > > >> > > > My second question is about this key feature: >> > > > > *Inter-Agent Communication:* Built-in support for asynchronous >> > > > agent-to-agent communication using Kafka. >> > > > >> > > > Does this mean the code from the flink-agents repo will have a >> > dependency >> > > > on AK? One of the big benefits of Flink is that it is independent of >> > the >> > > > underlying message streaming system. Wouldn't it be more elegant and >> > > > actually easier to rely on the Flink connector framework here, and >> > leave >> > > > the concrete implementation to the user? >> > > > Also, I wonder why we need to rely on an external message streaming >> > > system >> > > > at all? Is it because we want to be able to send messages into >> > arbitrary >> > > > directions? if so, maybe we can re-use code from Flink Statefun? I >> > > > personally would think that relying on Flink's internal data >> transfer >> > > model >> > > > by default brings a lot of cost, performance, operations and >> > > implementation >> > > > benefits ... and users can still manually setup a connector using a >> > > Kafka, >> > > > Pulsar or PubSub connection. WDYT? >> > > > >> > > > Best, >> > > > Robert >> > > > >> > > > >> > > > [1] >> > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/CASSANDRA/Cassandra+Sub+Projects >> > > > [2] >> > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/HADOOP/Apache+Hadoop+Ozone+-+sub-project+to+Apache+TLP+proposal >> > > > >> > > > >> > > > On Fri, May 23, 2025 at 6:14 AM Xintong Song <tonysong...@gmail.com >> > >> > > > wrote: >> > > > >> > > > > @Jing, >> > > > > >> > > > > I think the FLIP already included the high-level design goals, by >> > > listing >> > > > > the key features that we plan to support in the Proposed Solution >> > > > section, >> > > > > and demonstrating how using the framework may look like with the >> code >> > > > > examples. Of course the high-level goals need to be further >> detailed, >> > > > which >> > > > > will be the next step. The purpose of this FLIP is to get >> community >> > > > > consensus on initiating this new project. On the other hand, >> > technical >> > > > > design takes time to discuss, and likely requires continuous >> > iteration >> > > as >> > > > > the project is being developed. So I think it makes sense to >> separate >> > > the >> > > > > design discussions from the initiation proposal. >> > > > > >> > > > > Of course any contributor's thoughts and inputs are valuable to >> the >> > > > > project. And efficiency also matters, as the agentic ai industry >> > grows >> > > > > fast, we really need to keep up with the pace. I believe it would >> be >> > > more >> > > > > efficient to come up with some initial draft design / >> implementation >> > > that >> > > > > everyone can comment on, compared to just randomly collecting >> ideas >> > > when >> > > > we >> > > > > have nothing. Fortunately, the project is at the early stage with >> no >> > > > > historical burdens, which means we don't need to carefully make >> sure >> > > > > everything is perfect in advance, and can always correct / change >> / >> > > > rework >> > > > > things if needed. We can at least do that before we commit to the >> > > product >> > > > > compatibility with the first formal release. This is why we >> suggested >> > > > > applying a light, execution-first process, as mentioned in the >> > > Operating >> > > > > Model section. I would not be concerned too much about not >> collecting >> > > > > enough inputs at the beginning, because we can always adjust >> things >> > > > > afterwards based on new suggestions and opinions. >> > > > > >> > > > > Best, >> > > > > >> > > > > Xintong >> > > > > >> > > > > >> > > > > >> > > > > On Fri, May 23, 2025 at 12:13 AM Jing Ge >> <j...@ververica.com.invalid >> > > >> > > > > wrote: >> > > > > >> > > > > > It is great to see that everyone in this thread agreed with the >> > > > > high-level >> > > > > > proposal. Just so excited and could not stop asking questions >> :-) >> > > > Thanks >> > > > > > Xintong for the update! >> > > > > > >> > > > > > I'd like to share a little bit more thoughts with my questions >> and >> > > your >> > > > > > additional input. And then lead to a small suggestion. >> > > > > > >> > > > > > 1. It is great to support freestyle tools beyond MCP protocol, >> from >> > > > users >> > > > > > perspective. However, if we consider agent framework design, >> there >> > > > might >> > > > > be >> > > > > > some choices to make. For example, either we stick to MCP >> > internally >> > > > and >> > > > > > turn such external freestyle tools into MCP internally or we >> will >> > > > design >> > > > > a >> > > > > > new abstraction to handle diverse function calls offered by >> > different >> > > > > > LLMs, kind of repeating what MCP did. Another thought, which I >> > feel, >> > > > is >> > > > > > that the sample API in the FLIP shows somehow, as a user, after >> a >> > MCP >> > > > > > server registration, I could use the close follow-up prompt() >> > method >> > > to >> > > > > > modify/extend the standard out-of-box context provided by the >> MCP >> > > > server. >> > > > > > But it is too detailed and should not be discussed in this >> > high-level >> > > > > > thread. Happy to join any (offline) discussion and contribute. >> > > > > > >> > > > > > 3. Similar to microservices, there are a few use cases that are >> > > > sensitive >> > > > > > to the response latency, e.g. stock trading, etc. But it is >> totally >> > > > fine >> > > > > to >> > > > > > focus on asynchronous communication. >> > > > > > >> > > > > > 4. because each of them has individual focus and needs effort to >> > > build. >> > > > > It >> > > > > > was a question of priorities. Good to know Flink Agent wants to >> > cover >> > > > > both. >> > > > > > >> > > > > > 5. Great to know. I had a similar thought and was a little bit >> > > > confused, >> > > > > > because state is more or less a low level concept for operators. >> > > > Looking >> > > > > > forward to understanding how to use it as agent memory. >> > > > > > >> > > > > > What I actually tried to suggest with all these questions is: >> Does >> > it >> > > > > make >> > > > > > sense to define some high-level design >> goals/criterias/guidelines? >> > > > > like(as >> > > > > > an example): >> > > > > > >> > > > > > 1. support MCP natively >> > > > > > 2. single Agent development (for the first stage) >> > > > > > 3. only support event-driven asynchronous communication >> > > > > > 4. agent framework for both embedding and workflow >> development(same >> > > > > > priority) >> > > > > > 5. Flink state as memory >> > > > > > 6. support ReAct, don't support ReWOO (just as an example to >> show >> > my >> > > > > > thought. In reality, ReWOO might be useful for some enterprise >> > agents >> > > > > > considering the deterministic process. An example topic to be >> > > > discussed.) >> > > > > > >> > > > > > Any contributors in the community can also share their thoughts >> > about >> > > > any >> > > > > > high level design guidelines to be collected at an early stage. >> > > > > > >> > > > > > The final chosen high-level guidelines could help let everyone >> on >> > the >> > > > > same >> > > > > > page to understand and design the upcoming architecture and >> might >> > > also >> > > > > have >> > > > > > influence on the future API design. WDYT? >> > > > > > >> > > > > > Best regards, >> > > > > > Jing >> > > > > > >> > > > > > >> > > > > > On Thu, May 22, 2025 at 4:55 AM Xintong Song < >> > tonysong...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > > > Thanks everyone for the positive feedback. >> > > > > > > >> > > > > > > As I said, this FLIP is intended for discussing high-level >> plans >> > > for >> > > > > the >> > > > > > > new project. The project itself is still at an early stage, >> and >> > > some >> > > > of >> > > > > > the >> > > > > > > technical designs and solutions are not completely ready yet. >> So >> > > atm >> > > > I >> > > > > > can >> > > > > > > only share some personal thoughts on the raised questions, >> and we >> > > are >> > > > > > open >> > > > > > > to suggestions and opinions. >> > > > > > > >> > > > > > > @Jing >> > > > > > > >> > > > > > > 1. Regarding MCP, I think it's just one way (and likely a >> major >> > > way) >> > > > > for >> > > > > > > providing LLMs with context, but not the only way. E.g., a >> user >> > may >> > > > > > write a >> > > > > > > dedicated python function and provide it to the LLM as a tool, >> > > which >> > > > > > > doesn't necessarily need to go through the MCP protocol. At >> the >> > > same, >> > > > > the >> > > > > > > LLM may discover more available tools from a MCP server. These >> > are >> > > > > just 2 >> > > > > > > different sources that the tools come from, and they can >> > co-exist. >> > > > > > > >> > > > > > > 2. In the long-term, yes, I think. As a first step, we >> probably >> > > will >> > > > be >> > > > > > > more focused on how to build individual agents, less on >> > > interactions >> > > > > > across >> > > > > > > multiple agents. Not saying we won't support MAS in the first >> > > step, >> > > > > but >> > > > > > > maybe not as complex as the A2A protocol. >> > > > > > > >> > > > > > > 3. Interactions between agents will be event-driven, so they >> are >> > > > > > naturally >> > > > > > > asynchronous. I'm not entirely sure about use cases that >> prefer >> > > > > > > asynchronous agent calls. Could you share some examples? >> > > > > > > >> > > > > > > 4. I think I didn't fully get the taxonomy here. I mean why >> > > embedding >> > > > > vs. >> > > > > > > workflow? From my understanding, I think Flink Agents should >> > cover >> > > > both >> > > > > > use >> > > > > > > cases. >> > > > > > > >> > > > > > > 5. Yes, memory is considered. Actually, Flink's state >> management >> > > > makes >> > > > > a >> > > > > > > good foundation for supporting agent memory. >> > > > > > > >> > > > > > > @Nishita >> > > > > > > >> > > > > > > 1. I think calling an external LLM is similar to an async >> > operator >> > > in >> > > > > > > Flink, in terms of potential latency and backpressure issues. >> > > Flink's >> > > > > > async >> > > > > > > operator already supports concurrent async calls, rate >> control, >> > > > timeout >> > > > > > > handling, etc. But eventually, the bottleneck is at the >> external >> > > > > service >> > > > > > > side, and we expect the model techniques will keep improving, >> > with >> > > > > larger >> > > > > > > throughput, less latency, and better stability. >> > > > > > > >> > > > > > > 2. Good question. I think real-time event-driven processing is >> > > > somehow >> > > > > in >> > > > > > > conflict with asynchronous human-in-the-loop feedback. One >> idea >> > is >> > > > > that, >> > > > > > > I've seen people doing this way, to build another agent for >> > > > validating >> > > > > > > results and generating feedback. Another idea is to collect >> > samples >> > > > of >> > > > > > > results for asynchronous human-in-the-loop validations. But >> these >> > > are >> > > > > > just >> > > > > > > rough ideas. I don't have sophisticated answers at the moment. >> > > > > > > >> > > > > > > Best, >> > > > > > > >> > > > > > > Xintong >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Thu, May 22, 2025 at 3:26 AM Yash Anand >> > > > <yan...@confluent.io.invalid >> > > > > > >> > > > > > > wrote: >> > > > > > > >> > > > > > > > Thank you for the proposal—this initiative will make it much >> > > easier >> > > > > to >> > > > > > > > build event-driven AI agents seamlessly. >> > > > > > > > >> > > > > > > > +1 for the proposed Flink Agents sub-project! >> > > > > > > > >> > > > > > > > On Wed, May 21, 2025 at 9:43 AM Mayank Juneja < >> > > > > > mayankjunej...@gmail.com> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > +1 on the FLIP. This is a solid step toward building an >> > agentic >> > > > > > > offering >> > > > > > > > > that really leans into Flink’s strengths, and builds on >> the >> > > > > momentum >> > > > > > > from >> > > > > > > > > recent API improvements like FLIP-437 and the proposed >> > > FLIP-529. >> > > > > > > > > >> > > > > > > > > Also wanted to echo the point around agent memory. More >> > > advanced >> > > > > > > agentic >> > > > > > > > > systems really benefit from both short-term and long-term >> > > memory. >> > > > > > While >> > > > > > > > > long-term memory can live in databases (including vector >> > > stores), >> > > > > > > having >> > > > > > > > a >> > > > > > > > > built-in abstraction for managing short-term memory would >> be >> > > > super >> > > > > > > > useful. >> > > > > > > > > Doesn’t need to be in the MVP, but definitely worth >> > considering >> > > > for >> > > > > > the >> > > > > > > > > roadmap. >> > > > > > > > > Best, >> > > > > > > > > Mayank >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Wed, May 21, 2025 at 4:54 PM Lincoln Lee < >> > > > > lincoln.8...@gmail.com> >> > > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > +1 for the proposed flink agents sub-project! >> > > > > > > > > > >> > > > > > > > > > This aligns perfectly with flink's core strengths in >> > > real-time >> > > > > > event >> > > > > > > > > > processing and stateful computations. >> > > > > > > > > > >> > > > > > > > > > Thanks for driving this initiative and looking forward >> to >> > the >> > > > > > > > > > detailed technical designs. >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > Best, >> > > > > > > > > > Lincoln Lee >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > Hao Li <lihao3...@gmail.com> 于2025年5月21日周三 23:28写道: >> > > > > > > > > > >> > > > > > > > > > > Hi Xintong, Sean and Chris, >> > > > > > > > > > > >> > > > > > > > > > > Thanks for driving the initiative. Very exciting to >> bring >> > > AI >> > > > > > Agent >> > > > > > > to >> > > > > > > > > > Flink >> > > > > > > > > > > to empower the streaming use cases. >> > > > > > > > > > > >> > > > > > > > > > > +1 to the FLIP. >> > > > > > > > > > > >> > > > > > > > > > > Thanks, >> > > > > > > > > > > Hao >> > > > > > > > > > > >> > > > > > > > > > > On Wed, May 21, 2025 at 7:35 AM Nishita Pattanayak < >> > > > > > > > > > > nishita.pattana...@gmail.com> wrote: >> > > > > > > > > > > >> > > > > > > > > > > > Hi Sean, Chris and Xintong. This seems to be a very >> > > > exciting >> > > > > > > > > > sub-project. >> > > > > > > > > > > > +1 for "flink-agents" sub-project. >> > > > > > > > > > > > >> > > > > > > > > > > > I was going through the FLIP , and had some >> questions >> > > > > regarding >> > > > > > > the >> > > > > > > > > > same: >> > > > > > > > > > > > 1. How would the external model calls (e.g., OpenAI >> or >> > > > > internal >> > > > > > > > LLMs) >> > > > > > > > > > > > integrated into Flink tasks without introducing >> > > > backpressure >> > > > > or >> > > > > > > > > latency >> > > > > > > > > > > > issues? >> > > > > > > > > > > > In my experience, calling an external LLM has the >> > > following >> > > > > > > > > > > > risks: Latency-sensitive (LLM inference can take >> > hundreds >> > > > of >> > > > > > > > > > milliseconds >> > > > > > > > > > > > to seconds), Flaky (network issues, rate limits) as >> > well >> > > as >> > > > > it >> > > > > > > > > > > > is Non-deterministic (with timeouts, retries, >> etc.). It >> > > > would >> > > > > > be >> > > > > > > > > great >> > > > > > > > > > to >> > > > > > > > > > > > work/brainstorm on how we solve these issues. >> > > > > > > > > > > > 2. In traditional agent workflows, user feedback >> often >> > > > plays >> > > > > a >> > > > > > > key >> > > > > > > > > role >> > > > > > > > > > > in >> > > > > > > > > > > > validating and improving agent outputs. In a >> > continuous, >> > > > > > > > long-running >> > > > > > > > > > > > Flink-based agent system, where interactions might >> not >> > be >> > > > > > > > user-facing >> > > > > > > > > > or >> > > > > > > > > > > > synchronous, how do we incorporate human-in-the-loop >> > > > feedback >> > > > > > or >> > > > > > > > > > > > correctness signals to validate and iteratively >> improve >> > > > agent >> > > > > > > > > behavior? >> > > > > > > > > > > > >> > > > > > > > > > > > This is a really exciting direction for the Flink >> > > > ecosystem. >> > > > > > The >> > > > > > > > idea >> > > > > > > > > > of >> > > > > > > > > > > > building long-running, context-aware agents >> natively on >> > > > Flink >> > > > > > > feels >> > > > > > > > > > like >> > > > > > > > > > > a >> > > > > > > > > > > > natural evolution of stream processing. I'd love to >> see >> > > > this >> > > > > > > mature >> > > > > > > > > and >> > > > > > > > > > > > would be excited to contribute in any way I can to >> help >> > > > > > > > productionize >> > > > > > > > > > and >> > > > > > > > > > > > validate this in real-world use cases. >> > > > > > > > > > > > >> > > > > > > > > > > > On Wed, May 21, 2025 at 8:52 AM Xintong Song < >> > > > > > > > tonysong...@gmail.com> >> > > > > > > > > > > > wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > Hi devs, >> > > > > > > > > > > > > >> > > > > > > > > > > > > Sean, Chris and I would like to start a >> discussion on >> > > > > > FLIP-531 >> > > > > > > > [1], >> > > > > > > > > > > about >> > > > > > > > > > > > > introducing a new sub-project, Flink Agents. >> > > > > > > > > > > > > >> > > > > > > > > > > > > With the rise of agentic AI, we have identified >> great >> > > new >> > > > > > > > > > opportunities >> > > > > > > > > > > > for >> > > > > > > > > > > > > Flink, particularly in the system-triggered agent >> > > > > scenarios. >> > > > > > We >> > > > > > > > > > believe >> > > > > > > > > > > > the >> > > > > > > > > > > > > future of AI agent applications is industrialized, >> > > where >> > > > > > agents >> > > > > > > > > will >> > > > > > > > > > > not >> > > > > > > > > > > > > only be triggered by users, but increasingly by >> > systems >> > > > as >> > > > > > > well. >> > > > > > > > > > > Flink's >> > > > > > > > > > > > > event capabilities in real-time distributed event >> > > > > processing, >> > > > > > > > state >> > > > > > > > > > > > > management and exact-once consistency fault >> tolerance >> > > > make >> > > > > it >> > > > > > > > > > > well-suited >> > > > > > > > > > > > > as a framework for building such system-triggered >> > > agents. >> > > > > > > > > > Furthermore, >> > > > > > > > > > > > > system-triggered agents are often tightly coupled >> > with >> > > > data >> > > > > > > > > > processing. >> > > > > > > > > > > > > Flink's outstanding data processing capabilities >> > allows >> > > > > > > seamless >> > > > > > > > > > > > > integration between data and agentic processing. >> > These >> > > > > > > > capabilities >> > > > > > > > > > > > > differentiate Flink from other agent frameworks >> with >> > > > unique >> > > > > > > > > > advantages >> > > > > > > > > > > in >> > > > > > > > > > > > > the context of system-triggered agents. >> > > > > > > > > > > > > >> > > > > > > > > > > > > We propose this effort as a sub-project of Apache >> > > Flink, >> > > > > > with a >> > > > > > > > > > > separate >> > > > > > > > > > > > > code repository and lightweight developing >> process, >> > for >> > > > > rapid >> > > > > > > > > > iteration >> > > > > > > > > > > > > during the early stage. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Please note that this FLIP is focused on the >> > high-level >> > > > > > plans, >> > > > > > > > > > > including >> > > > > > > > > > > > > motivation, positioning, goals, roadmap, and >> > operating >> > > > > model >> > > > > > of >> > > > > > > > the >> > > > > > > > > > > > > project. Detailed technical design is out of the >> > scope >> > > > and >> > > > > > will >> > > > > > > > be >> > > > > > > > > > > > > discussed during the rapid prototyping and >> > iterations. >> > > > > > > > > > > > > >> > > > > > > > > > > > > For more details, please check the FLIP [1]. >> Looking >> > > > > forward >> > > > > > to >> > > > > > > > > your >> > > > > > > > > > > > > feedback. >> > > > > > > > > > > > > >> > > > > > > > > > > > > Best, >> > > > > > > > > > > > > >> > > > > > > > > > > > > Xintong >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > [1] >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-531%3A+Initiate+Flink+Agents+as+a+new+Sub-Peoject >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > *Mayank Juneja* >> > > > > > > > > Product Manager | Data Streaming and AI >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >