GitHub user xintongsong added a comment to the discussion: Replay-based Per Action State Consistency
@letaoj Thanks for updating the design doc based on our offline discussion. I think the overall design is quite good now. I just have a few more comments on the details. 1. For the request-response map, I think we should use some unique identifier of the action execution as a key, rather than hash of event. Because one event may trigger multiple actions. It looks right from `TaskActionState` which tries to capture the execution state of an action. But in the execution flow, it shows hash of events are used as part of the map key. 2. I'd suggest not to rebuild the short-term memory at the beginning, but to rebuild it during replaying the actions. To be specific, when recovering from a checkpoint, the short-term memory (state) should be restored to how it was when the checkpoint was made. Then we replay the inputs, and check for whether the action has already been performed. If performed, we skip the action, applies any state changes it made, and get the output (events). This ensures actions being re-executed see the same state as it was executed for the first time. 3. `<message_key>-<event_hash_1>: {"request": request, "short-term-memory": short_term_memory.dump_json()"}` Does this mean we are storing the whole short-term memory for each request-response pair? That should be unnecessary. Since the full short-term memory is already persisted with the checkpoint, we only need to persist the incremental changes of short-term memory since the checkpint. 4. `TaskActionState .output_event` should be a list, because each action may emit multiple events. GitHub link: https://github.com/apache/flink-agents/discussions/108#discussioncomment-14209491 ---- This is an automatically sent email for issues@flink.apache.org. To unsubscribe, please send an email to: issues-unsubscr...@flink.apache.org