weiqingy opened a new pull request, #717:
URL: https://github.com/apache/flink-agents/pull/717

   Linked issue: #716
   
   ### Purpose of change
   
   The live-LLM e2e and cross-language CI tests run a small Ollama model 
(`qwen3:1.7b`) and fail intermittently — either the model returns a wrong 
tool-call result (e.g. `assert <varies> == 1386528`) or an Ollama call exceeds 
its read timeout (`httpx.ReadTimeout`). These flakes reproduce across many 
branches including `main`, turning CI red on unrelated PRs. See #716 for the 
failure statistics and evidence.
   
   This PR mitigates the flakiness without masking real, deterministic failures:
   
   1. **Per-test retry, scoped to the live-LLM e2e/cross-language suites 
only.** Python uses `pytest-rerunfailures` (`--reruns 2 --reruns-delay 5`); 
Java uses Surefire `-Dsurefire.rerunFailingTestsCount=2`. Both are applied only 
at the e2e/cross-language test invocations — the unit and style invocations are 
untouched, so a genuine regression still fails immediately. A test that passes 
on retry produces a green build but is reported as a flake (pytest `R` markers 
/ Surefire "Flakes"), so the signal is preserved, not hidden.
   
   2. **Close the one remaining 30 s timeout gap.** The cross-language test's 
Python-wrapped Ollama connection fell back to the Python default 
`request_timeout` of 30 s; this sets it to 240 s, matching the Java-native 
connection already configured in the same test agent.
   
   Loosening exact-equality assertions on LLM output (asserting tool-call shape 
rather than an exact value) is noted in #716 as longer-term hardening and is 
intentionally out of scope here.
   
   ### Tests
   
   This is a CI/test-configuration change, so no new product tests are added (a 
test asserting "retry is configured" would be tautological). Verified:
   - The retry flags are scoped correctly — a deliberately-failing test is 
retried 3× under an e2e selector but runs once under the unit selector; the 
unit pytest (`-k "not e2e_tests"`) and unit mvn (`-pl "${exclude_list}"`) 
invocations are unchanged.
   - `pytest-rerunfailures==16.3` resolves against `pytest==9.0.3`, and the e2e 
jobs install it via `tools/build.sh`'s `uv sync --extra dev` (the `dev` extra 
composes `test`) before the `--no-sync` pytest runs.
   - `mvn ... -Dsurefire.rerunFailingTestsCount=2 test-compile` is accepted on 
Surefire 3.5.2; the cross-language module compiles with the timeout change.
   
   ### API
   
   No public API change. All changes are test-configuration (CI scripts, the 
Python test extra, and one e2e test agent).
   
   ### Documentation
   
   - [ ] `doc-needed`
   - [x] `doc-not-needed`
   - [ ] `doc-included`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to