GitHub user letaoj created a discussion: WIP: Replay-based Per Action State Consistency
# Background Flink Agents execute various actions during record processing, including model inference and tool invocation. Model inference involves calling LLMs for reasoning, classification, or generation tasks, often through expensive API calls to external providers. Tool invocation allows agents to interact with external systems through UDFs with network access, with native support for Model Context Protocol (MCP). These actions enable agents to perform contextual searches, execute business logic, interact with enterprise systems, and invoke specialized processing services. ## The Problem: Side Effects and Costs from Action Replay While Flink provides exactly-once processing guarantees for stream processing on a per message basis, agent actions create challenges around side effects, costs, and recovery semantics. Both model inference and tool invocation can produce effects that persist beyond the agent's execution context or incur significant costs that should not be duplicated. The core problem occurs when: 1. A Flink agent processes a record and executes multiple actions (model calls, tool calls) 2. Some actions complete successfully, potentially modifying external state or consuming costly resources 3. The agent crashes before completing record processing 4. Upon recovery, Flink reprocesses the same record, repeating all actions This creates several issues: * duplicate side effects in external systems * unnecessary costs from repeated expensive model inference calls * consistency violations from mixed successful and repeated actions * resource waste from redundant operations * observability challenges when debugging multiple executions of the same logical operation. Flink's streaming architecture introduces additional complexity through continuous processing on unbounded streams, distributed state management, back-pressure from action failures, and a semantic gap where exactly-once guarantees don't extend to external model providers or tool endpoints. # Goals and Non-Goals ## Goals * **Durability of Agent State**: Ensure that the state is durably stored when the agent crashes * **State Recovery**: Provide a mechanism to recover the agent's short-term memory by replaying the action history from the state store * **Minimal Performance Overhead**: The solution should have minimal impact on the performance of the Flink job during normal operation. * **Pluggable Architecture**: The design should be flexible enough to support different types of external databases. ## Non-Goals * **Long-Term Memory**: This design focuses on recovering the short-term memory of the current task. It does not address the problem of long-term memory or knowledge persistence. * **Database Management**: This design assumes that the external database is already set up and managed. It does not cover the details of database administration. # High-Level Design GitHub link: https://github.com/apache/flink-agents/discussions/108 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
