weiqingy opened a new pull request, #839: URL: https://github.com/apache/flink-agents/pull/839
Linked issue: #723 ### Purpose of change Python memory values cross the Pemja boundary into Flink state when written via `MemoryObject.set()`. Values that Pemja cannot materialize into native JVM types — Pydantic models, `uuid.UUID`, `str`/`int` `Enum`s, custom classes, `tuple`/`set`, or dicts with non-`str` keys — are stored as stale `PyObject` wrappers and crash the JVM in `JcpPyObject_FromJObject` when the state is restored after a TaskManager/Python process restart. This adds a `set()`-time guard that rejects such values early with a clear, actionable `TypeError`, instead of letting them reach state and fail on restore. It follows #828 (which normalized the framework's own built-in tool-context writes) and covers arbitrary user values. The accepted contract is recursive and exact-typed: `None | bool | int | float | str | list[...] | dict[str, ...]`. Exact-type matching (not `isinstance`) is required so a `str`/`int` `Enum` — which passes `isinstance(x, str)` yet is still Pemja-wrapped — cannot slip through. The error message names the offending nested location and suggests a conversion (e.g. `str(uuid)`, `model.model_dump(mode="json")`, `list(value)`). The same guard runs in both `LocalMemoryObject.set` and `FlinkMemoryObject.set`, so local execution fails the same way production would, giving one unified contract. Conformance to the new contract: the memory unit tests no longer store a `set`/custom class, and the `workflow` and `flink-integration` e2e agents now materialize their Pydantic payloads (`model_dump(mode="json")` on write, `model_validate` on read) — these stored models on the real Flink path and were latent instances of this same bug. Scope notes: this is forward-looking and does not migrate pre-fix checkpoints (same as #828). `bytes` is intentionally not yet accepted — the Python→Java `bytes` conversion through Pemja is unverified, and wrongly accepting an unsafe value would defeat the validator; a follow-up will verify and possibly include it. The Python memory-value-contract documentation is a separate follow-up PR (it touches only the memory docs page, with no code overlap). ### Tests New `runtime/tests/test_memory_value_validation.py` covers the accepted types (including nested `list`/`dict`) and every rejection — Pydantic model, `uuid.UUID`, `tuple`/`set`/`frozenset`, a `str` `Enum`, a custom class, non-`str` dict keys, a nested bad value (verifying the breadcrumb), a `MemoryObject` value (verifying the `new_object()` guidance), and that `FlinkMemoryObject.set` raises a raw `TypeError` rather than wrapping it in `MemoryObjectError`. A true before/after checkpoint-restore test cannot run on the MiniCluster (an in-place recovery there does not recreate the JVM, so the Pemja path is never crossed); the unit tests assert the accept/reject classification as the checkpoint-safety proxy, the same approach used in #828. `uv run --no-sync pytest flink_agents/runtime/tests/test_memory_value_validation.py flink_agents/runtime/tests/test_local_memory_object.py flink_agents/runtime/tests/test_memory_reference.py flink_agents/plan/tests/actions -k "not e2e"` passes (45), the broader `runtime/tests` suite passes (79), and `ruff check` is clean. ### API Adds `validate_memory_value(path, value)` in `flink_agents.api.memory_object`. `MemoryObject.set()` now raises `TypeError` for values that are not recursively checkpoint-stable; previously such values were accepted and failed only on state restore. ### Documentation - [x] `doc-needed` <!-- The Python memory-value-contract docs are a separate follow-up PR; this PR changes the accepted-value behavior. --> - [ ] `doc-not-needed` - [ ] `doc-included` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
