Re: [D] Replay-based Per Action State Consistency [flink-agents]

via GitHub Mon, 25 Aug 2025 14:18:48 -0700


GitHub user letaoj added a comment to the discussion: Replay-based Per Action 
State Consistency


Thanks @xintongsong for your valuable comment!

> For the request-response map, I think we should use some unique identifier of 
> the action execution as a key, rather than hash of event. Because one event 
> may trigger multiple actions. It looks right from TaskActionState which tries 
> to capture the execution state of an action. But in the execution flow, it 
> shows hash of events are used as part of the map key.

Updated both the diagram and the example to capture this. It make sense to use 
the unique identifier of the action instead of the events as the key suffix. 

> I'd suggest not to rebuild the short-term memory at the beginning, but to 
> rebuild it during replaying the actions. To be specific, when recovering from 
> a checkpoint, the short-term memory (state) should be restored to how it was 
> when the checkpoint was made. Then we replay the inputs, and check for 
> whether the action has already been performed. If performed, we skip the 
> action, applies any state changes it made, and get the output (events). This 
> ensures actions being re-executed see the same state as it was executed for 
> the first time.

Yes, that's the plan. The recovery part was meant to recover the short-term 
memory from the snapshot that I stored in the state but from the third comment 
you gave, it was already handled by flink checkpoint. No need to recover from 
the state itself.

> <message_key>-<event_hash_1>: {"request": request, "short-term-memory": 
> short_term_memory.dump_json()"} Does this mean we are storing the whole 
> short-term memory for each request-response pair? That should be unnecessary. 
> Since the full short-term memory is already persisted with the checkpoint, we 
> only need to persist the incremental changes of short-term memory since the 
> checkpint.

Yes, you are right. We do not need the full snapshot. I keep forgetting that 
the short-term memory is consider part of the flink state that will be 
persisted in the checkpoint

> TaskActionState .output_event should be a list, because each action may emit 
> multiple events.

Updated

GitHub link: 
https://github.com/apache/flink-agents/discussions/108#discussioncomment-14214800

----
This is an automatically sent email for issues@flink.apache.org.
To unsubscribe, please send an email to: issues-unsubscr...@flink.apache.org

Re: [D] Replay-based Per Action State Consistency [flink-agents]

Reply via email to