[PR] [test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions [flink-agents]

via GitHub Sun, 31 May 2026 18:13:11 -0700


weiqingy opened a new pull request, #722:
URL: https://github.com/apache/flink-agents/pull/722


   Linked issue: #719
   
   ### Purpose of change
   
   Follow-up to #716 (item 3). The live-LLM e2e tests assert on the agent's 
final output value (`react_agent_test` asserted `result == 1386528`) or weak 
substrings of it (`"3" in output`, `"22" in output`). A single check conflates 
three things a small non-deterministic model gets right at different rates: 
which tool was invoked, with what arguments, and the final synthesized answer — 
so a failure cannot be localized, and the substring checks barely test anything.
   
   This PR adds layered assertions on the structured `ToolRequestEvent`s the 
runtime already produces, so CI checks *what the agent did* (which tool, what 
arguments) rather than only an exact number the model often gets wrong.
   
   Sourced two ways, by execution path:
   - **Flink path** (cross-language tests + react remote): a shared helper 
`collect_tool_invocations(log_dir)` reads the `events-*.log` the 
`FileEventLogger` already writes.
   - **Local runner** (react local, `from_list`/`to_list`): the pure-Python 
`LocalRunner` has no event log, so a small in-memory capture hook exposes the 
`ToolRequestEvent`s that already flow through its event deque.
   
   Both yield the same `{name, arguments}` shape, so assertions read 
identically.
   
   Notes on two deliberate choices:
   - Final-value checks are kept (not relaxed). The agent's `.result` is a 
separate model synthesis step, not a tool output, so it can be wrong even when 
the tool calls are correct — this was observed live (the model called 
`multiply(4444, 312)` correctly but emitted a wrong final number). The value 
check catches a failure the tool assertions cannot, so it remains; its residual 
flakiness is covered by the agent's retries and the per-test retry from #716.
   - For the react local test we assert the `multiply` invocation, not `add`. 
The small model frequently computes the addition itself and only calls the 
multiply tool, so an `add` assertion would be an unreliable signal; 
`multiply`'s first argument is the threaded sum, so asserting it proves the 
addition was computed correctly and the tool was used.
   
   ### Tests
   
   - New fixture-based unit tests for the helpers (`collect_tool_invocations`, 
`assert_tool_invoked`, `tool_invocations_from_events`) — no live model required.
   - New unit test for the local-runner capture hook, asserting both that the 
event is captured and that it still dispatches to its action (so capture cannot 
silently break tool execution).
   - The strengthened e2e tests: chat_model / yaml / react (remote + local) 
cross-language tests.
   - The react local-runner test was exercised live (Ollama, qwen3:1.7b) across 
many runs to confirm the captured tool calls and arguments end-to-end. The 
Flink-path e2e tests run in CI.
   
   ### API
   
   Adds a test-facing accessor `get_tool_request_events()` on `LocalRunner` and 
`LocalExecutionEnvironment` (Python runtime), returning the `ToolRequestEvent`s 
captured during execution. No change to the Java side; the events were already 
emitted there.
   
   ### Documentation
   
   - [ ] `doc-needed`
   - [x] `doc-not-needed`
   - [ ] `doc-included`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [test][runtime] Strengthen live-LLM e2e tests with structured tool-invocation assertions [flink-agents]

Reply via email to