GitHub user letaoj edited a discussion: WIP: Replay-based Per Action State 
Consistency

# Background

Flink Agents execute various actions during record processing, including model 
inference and tool invocation. Model inference involves calling LLMs for 
reasoning, classification, or generation tasks, often through expensive API 
calls to external providers. Tool invocation allows agents to interact with 
external systems through UDFs with network access, with native support for 
Model Context Protocol (MCP). These actions enable agents to perform contextual 
searches, execute business logic, interact with enterprise systems, and invoke 
specialized processing services.

## The Problem: Side Effects and Costs from Action Replay

While Flink provides exactly-once processing guarantees for stream processing 
on a per message basis, agent actions create challenges around side effects, 
costs, and recovery semantics. Both model inference and tool invocation can 
produce effects that persist beyond the agent's execution context or incur 
significant costs that should not be duplicated.

The core problem occurs when:
1. A Flink agent processes a record and executes multiple actions (model calls, 
tool calls)
2. Some actions complete successfully, potentially modifying external state or 
consuming costly resources
3. The agent crashes before completing record processing
4. Upon recovery, Flink reprocesses the same record, repeating all actions

This creates several issues: 
* duplicate side effects in external systems (e.g. sending the same email 
multiple times to the same recipient) 
* unnecessary costs from repeated expensive model inference calls
* consistency violations from mixed successful and repeated actions
* resource waste from redundant operations
* observability challenges when debugging multiple executions of the same 
logical operation.

Flink's streaming architecture introduces additional complexity through 
continuous processing on unbounded streams, distributed state management, 
back-pressure from action failures, and a semantic gap where exactly-once 
guarantees don't extend to external model providers or tool endpoints.

# Goals and Non-Goals

## Goals

* **Durability of Agent State**: Ensure that the state is durably stored when 
the agent crashes
* **State Recovery**: Provide a mechanism to recover the agent's short-term 
memory by replaying the action history from the state store
* **Minimal Performance Overhead**: The solution should have minimal impact on 
the performance of the Flink job during normal operation.
* **Pluggable Architecture**: The design should be flexible enough to support 
different types of external databases.

## Non-Goals

* **Long-Term Memory**: This design focuses on recovering the short-term memory 
of the current task. It does not address the problem of long-term memory or 
knowledge persistence.
* **Database Management**: This design assumes that the external database is 
already set up and managed. It does not cover the details of database 
administration.

# High-Level Design


GitHub link: https://github.com/apache/flink-agents/discussions/108

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to