weiqingy opened a new issue, #836: URL: https://github.com/apache/flink-agents/issues/836
### Description PR #828 made the built-in chat-model tool-call context checkpoint-safe by normalizing non-primitive Python values (`UUID`, `OutputSchema`, `List[ChatMessage]`) to a primitive-only form before they reach sensory memory, and reconstructing the rich types on read. Without this, Pemja wraps those objects as `PyObject` holders whose JNI pointers go stale after a TaskManager/Python restart, so restoring the checkpointed context SIGSEGVs in `JcpPyObject_FromJObject`. The unit tests in #828 assert that the stored form is recursively primitive as a checkpoint-safety *proxy*, because the bug cannot be reproduced in local/MiniCluster mode: that path never crosses Pemja, and triggering an in-place recovery by throwing an exception does not recreate the JVM, so the failing code path is never exercised. After #708 lands, we should add an end-to-end test that triggers a real recovery in a standalone cluster — e.g. by killing the TaskManager process so the JVM (and the embedded Python interpreter) is recreated — and verify that the built-in tool-context flow recovers correctly. This would give true before/after verification of the #828 fix rather than relying on the primitive-form proxy. Depends on #708. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
