[I] [Tech Debt] Strengthen live-LLM e2e assertions via structured tool-invocation events [flink-agents]

via GitHub Sun, 31 May 2026 10:45:05 -0700


weiqingy opened a new issue, #719:
URL: https://github.com/apache/flink-agents/issues/719


   ### Search before asking
   
   - [x] I searched in the 
[issues](https://github.com/apache/flink-agents/issues) and found nothing 
similar.
   
   ### Description
   
   Follow-up to #716 (item 3, the optional hardening direction). #716's 
load-bearing flakiness fixes — per-test retry and the Python-connection timeout 
— shipped in #717 and restored trustworthy CI signal. This issue tracks the 
remaining improvement: making the live-LLM e2e tests assert on **what the agent 
did** rather than an exact numeric result.
   
   #### Problem
   
   Several live-LLM e2e tests assert an **exact** final value produced by a 
small, non-deterministic model (`qwen3:1.7b`). The clearest case is 
`react_agent_test.py:135` (latest `main`):
   
   ```python
   input_list.append({"key": "0001", "value": InputData(a=2123, b=2321, 
c=312)})  # :128
   ...
   assert output_list[0]["0001"].result == 1386528                              
   # :135  (2123 + 2321) * 312
   ```
   
   This single assertion conflates three independent dimensions that the model 
gets right at very different rates. From the #713 review discussion, the 1.7B 
model has been observed to: answer directly without tools, call the right tool 
with wrong arguments, emit a tool call as plain text instead of a real tool 
call, miss a later calculation step, get a correct tool result but still 
produce a wrong final answer, or return a response that doesn't match the 
output schema. So **"the right tool was invoked" and "the final answer is 
correct" are separate concerns**, and an exact-value check tests the most 
model-capability-dependent dimension while saying nothing about the rest.
   
   #### Direction — layered assertions
   
   Decompose the single exact-value check into separable assertions, ordered by 
how deterministic each is:
   
   1. **The correct tool was invoked** (`add` / `multiply`) — order- and 
model-capability-independent; the most stable signal for the tool-calling path.
   2. **The tool was invoked with the expected arguments** (e.g. `add(2123, 
2321)`, `multiply(4444, 312)`) — validates input parsing without depending on 
the model's final reasoning. (Per the #713 thread: invoking the right tool 
isn't meaningful unless the arguments are also right.)
   3. **Final-output correctness** — kept as a *separate* concern. Where layers 
1–2 already pin the tool inputs, the final value is deterministic arithmetic 
the tools computed, so the exact-value check becomes redundant and can relax to 
a schema/shape check; retries (#717) already cover the residual 
model-capability flakiness.
   
   **Explicit non-goal:** we will **not** build a pass-set from observed test 
outputs. The values seen in #716 (`45750`, `4927483`, `432596736`, …) are 
*wrong* answers; accepting "any previously-returned value" would whitelist 
known-broken outputs and the test would pass on a broken agent. A pass-set 
built from outputs only ever grows toward "always pass." Accepting *equivalent 
representations of the known-correct value* (formatting/whitespace) is fine; 
accepting *observed outputs* is not.
   
   #### Prerequisite — tool events exist, but only on the Flink path
   
   The runtime already emits and persists tool events; this is **not** 
new-runtime work, but the coverage is uneven by runner:
   
   - `ToolRequestEvent` carries `tool_calls: List<{id, name, arguments}>` and 
`ToolResponseEvent` carries `responses` / `success` / `error` 
(`api/.../event/ToolRequestEvent.java`, `api/.../event/ToolResponseEvent.java`; 
Python `python/flink_agents/api/events/tool_event.py`). Both tool **name and 
arguments** are present.
   - On the **Flink/operator runner**, every event flows 
`EventRouter.notifyEventProcessed()` → `FileEventLogger.append()` 
(`runtime/.../operator/EventRouter.java`, 
`runtime/.../eventlog/FileEventLogger.java`), writing `events-*.log` JSON Lines 
under a configurable `baseLogDir`. `python_event_logging_test.py` already reads 
these (set `baseLogDir`, run, `glob("events-*.log")`, parse JSON).
   - The **local runner** (`LocalRunner`, 
`python/flink_agents/runtime/local_runner.py`) has **no event log** — 
`send_event` only appends to an in-memory deque and emits an unstructured 
`logger.info` line (`local_runner.py:116-125`). There is zero event-log wiring 
in the Python runtime (confirmed by grep on `main`).
   
   This splits the work into two tracks:
   
   - **Cross-language e2e tests** (`chat_model_cross_language_test.py` et al., 
all `from_datastream().to_datastream()` → Flink path): tool events are already 
persisted. Needs only a small shared **read helper** (e.g. 
`collect_tool_invocations(log_dir) -> [{name, arguments}]`) over the existing 
`events-*.log` format to replace the brittle text-scan assertions (cf. the weak 
`"3"` substring check flagged on #713).
   - **`react_agent_test::test_react_agent_on_local_runner`** 
(`from_list().apply().to_list()` → `LocalRunner`): the event log isn't 
available. Needs **either** (a) a small in-memory event-collection hook on 
`LocalRunner` so a test can read emitted `ToolRequestEvent`s directly (events 
already pass through its deque — just not exposed), **or** (b) switching the 
test to the Flink-backed runner like `python_event_logging_test`. Option (a) 
keeps the local runner lightweight and is the smaller change.
   
   #### Scope
   
   1. Add a shared test helper to read tool-invocation events from the 
Flink-path event log.
   2. Convert exact-equality assertions in the cross-language suites to layered 
tool-name + argument assertions via that helper; relax final-output checks to 
schema/shape where layers 1–2 make them redundant.
   3. Expose tool events to the local runner (small `LocalRunner` collector 
hook) and convert `react_agent_test`'s local-runner assertion.
   4. (Investigation) Mine the last ~N CI runs to quantify per-dimension pass 
rates (tool-invoked vs. correct-args vs. exact-answer); use that data to choose 
where the hard assertion sits — as input to the design, not as a runtime 
pass-set.
   
   #### Out of scope
   
   Retry/timeout infrastructure (shipped in #717).
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Tech Debt] Strengthen live-LLM e2e assertions via structured tool-invocation events [flink-agents]

Reply via email to