weiqingy opened a new issue, #836:
URL: https://github.com/apache/flink-agents/issues/836

   ### Description
   
   PR #828 made the built-in chat-model tool-call context checkpoint-safe by 
normalizing non-primitive Python values (`UUID`, `OutputSchema`, 
`List[ChatMessage]`) to a primitive-only form before they reach sensory memory, 
and reconstructing the rich types on read. Without this, Pemja wraps those 
objects as `PyObject` holders whose JNI pointers go stale after a 
TaskManager/Python restart, so restoring the checkpointed context SIGSEGVs in 
`JcpPyObject_FromJObject`.
   
   The unit tests in #828 assert that the stored form is recursively primitive 
as a checkpoint-safety *proxy*, because the bug cannot be reproduced in 
local/MiniCluster mode: that path never crosses Pemja, and triggering an 
in-place recovery by throwing an exception does not recreate the JVM, so the 
failing code path is never exercised.
   
   After #708 lands, we should add an end-to-end test that triggers a real 
recovery in a standalone cluster — e.g. by killing the TaskManager process so 
the JVM (and the embedded Python interpreter) is recreated — and verify that 
the built-in tool-context flow recovers correctly. This would give true 
before/after verification of the #828 fix rather than relying on the 
primitive-form proxy.
   
   Depends on #708.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to