weiqingy opened a new issue, #719: URL: https://github.com/apache/flink-agents/issues/719
### Search before asking - [x] I searched in the [issues](https://github.com/apache/flink-agents/issues) and found nothing similar. ### Description Follow-up to #716 (item 3, the optional hardening direction). #716's load-bearing flakiness fixes — per-test retry and the Python-connection timeout — shipped in #717 and restored trustworthy CI signal. This issue tracks the remaining improvement: making the live-LLM e2e tests assert on **what the agent did** rather than an exact numeric result. #### Problem Several live-LLM e2e tests assert an **exact** final value produced by a small, non-deterministic model (`qwen3:1.7b`). The clearest case is `react_agent_test.py:135` (latest `main`): ```python input_list.append({"key": "0001", "value": InputData(a=2123, b=2321, c=312)}) # :128 ... assert output_list[0]["0001"].result == 1386528 # :135 (2123 + 2321) * 312 ``` This single assertion conflates three independent dimensions that the model gets right at very different rates. From the #713 review discussion, the 1.7B model has been observed to: answer directly without tools, call the right tool with wrong arguments, emit a tool call as plain text instead of a real tool call, miss a later calculation step, get a correct tool result but still produce a wrong final answer, or return a response that doesn't match the output schema. So **"the right tool was invoked" and "the final answer is correct" are separate concerns**, and an exact-value check tests the most model-capability-dependent dimension while saying nothing about the rest. #### Direction — layered assertions Decompose the single exact-value check into separable assertions, ordered by how deterministic each is: 1. **The correct tool was invoked** (`add` / `multiply`) — order- and model-capability-independent; the most stable signal for the tool-calling path. 2. **The tool was invoked with the expected arguments** (e.g. `add(2123, 2321)`, `multiply(4444, 312)`) — validates input parsing without depending on the model's final reasoning. (Per the #713 thread: invoking the right tool isn't meaningful unless the arguments are also right.) 3. **Final-output correctness** — kept as a *separate* concern. Where layers 1–2 already pin the tool inputs, the final value is deterministic arithmetic the tools computed, so the exact-value check becomes redundant and can relax to a schema/shape check; retries (#717) already cover the residual model-capability flakiness. **Explicit non-goal:** we will **not** build a pass-set from observed test outputs. The values seen in #716 (`45750`, `4927483`, `432596736`, …) are *wrong* answers; accepting "any previously-returned value" would whitelist known-broken outputs and the test would pass on a broken agent. A pass-set built from outputs only ever grows toward "always pass." Accepting *equivalent representations of the known-correct value* (formatting/whitespace) is fine; accepting *observed outputs* is not. #### Prerequisite — tool events exist, but only on the Flink path The runtime already emits and persists tool events; this is **not** new-runtime work, but the coverage is uneven by runner: - `ToolRequestEvent` carries `tool_calls: List<{id, name, arguments}>` and `ToolResponseEvent` carries `responses` / `success` / `error` (`api/.../event/ToolRequestEvent.java`, `api/.../event/ToolResponseEvent.java`; Python `python/flink_agents/api/events/tool_event.py`). Both tool **name and arguments** are present. - On the **Flink/operator runner**, every event flows `EventRouter.notifyEventProcessed()` → `FileEventLogger.append()` (`runtime/.../operator/EventRouter.java`, `runtime/.../eventlog/FileEventLogger.java`), writing `events-*.log` JSON Lines under a configurable `baseLogDir`. `python_event_logging_test.py` already reads these (set `baseLogDir`, run, `glob("events-*.log")`, parse JSON). - The **local runner** (`LocalRunner`, `python/flink_agents/runtime/local_runner.py`) has **no event log** — `send_event` only appends to an in-memory deque and emits an unstructured `logger.info` line (`local_runner.py:116-125`). There is zero event-log wiring in the Python runtime (confirmed by grep on `main`). This splits the work into two tracks: - **Cross-language e2e tests** (`chat_model_cross_language_test.py` et al., all `from_datastream().to_datastream()` → Flink path): tool events are already persisted. Needs only a small shared **read helper** (e.g. `collect_tool_invocations(log_dir) -> [{name, arguments}]`) over the existing `events-*.log` format to replace the brittle text-scan assertions (cf. the weak `"3"` substring check flagged on #713). - **`react_agent_test::test_react_agent_on_local_runner`** (`from_list().apply().to_list()` → `LocalRunner`): the event log isn't available. Needs **either** (a) a small in-memory event-collection hook on `LocalRunner` so a test can read emitted `ToolRequestEvent`s directly (events already pass through its deque — just not exposed), **or** (b) switching the test to the Flink-backed runner like `python_event_logging_test`. Option (a) keeps the local runner lightweight and is the smaller change. #### Scope 1. Add a shared test helper to read tool-invocation events from the Flink-path event log. 2. Convert exact-equality assertions in the cross-language suites to layered tool-name + argument assertions via that helper; relax final-output checks to schema/shape where layers 1–2 make them redundant. 3. Expose tool events to the local runner (small `LocalRunner` collector hook) and convert `react_agent_test`'s local-runner assertion. 4. (Investigation) Mine the last ~N CI runs to quantify per-dimension pass rates (tool-invoked vs. correct-args vs. exact-answer); use that data to choose where the hard assertion sits — as input to the design, not as a runtime pass-set. #### Out of scope Retry/timeout infrastructure (shipped in #717). ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
