weiqingy opened a new pull request, #717:
URL: https://github.com/apache/flink-agents/pull/717
Linked issue: #716
### Purpose of change
The live-LLM e2e and cross-language CI tests run a small Ollama model
(`qwen3:1.7b`) and fail intermittently — either the model returns a wrong
tool-call result (e.g. `assert <varies> == 1386528`) or an Ollama call exceeds
its read timeout (`httpx.ReadTimeout`). These flakes reproduce across many
branches including `main`, turning CI red on unrelated PRs. See #716 for the
failure statistics and evidence.
This PR mitigates the flakiness without masking real, deterministic failures:
1. **Per-test retry, scoped to the live-LLM e2e/cross-language suites
only.** Python uses `pytest-rerunfailures` (`--reruns 2 --reruns-delay 5`);
Java uses Surefire `-Dsurefire.rerunFailingTestsCount=2`. Both are applied only
at the e2e/cross-language test invocations — the unit and style invocations are
untouched, so a genuine regression still fails immediately. A test that passes
on retry produces a green build but is reported as a flake (pytest `R` markers
/ Surefire "Flakes"), so the signal is preserved, not hidden.
2. **Close the one remaining 30 s timeout gap.** The cross-language test's
Python-wrapped Ollama connection fell back to the Python default
`request_timeout` of 30 s; this sets it to 240 s, matching the Java-native
connection already configured in the same test agent.
Loosening exact-equality assertions on LLM output (asserting tool-call shape
rather than an exact value) is noted in #716 as longer-term hardening and is
intentionally out of scope here.
### Tests
This is a CI/test-configuration change, so no new product tests are added (a
test asserting "retry is configured" would be tautological). Verified:
- The retry flags are scoped correctly — a deliberately-failing test is
retried 3× under an e2e selector but runs once under the unit selector; the
unit pytest (`-k "not e2e_tests"`) and unit mvn (`-pl "${exclude_list}"`)
invocations are unchanged.
- `pytest-rerunfailures==16.3` resolves against `pytest==9.0.3`, and the e2e
jobs install it via `tools/build.sh`'s `uv sync --extra dev` (the `dev` extra
composes `test`) before the `--no-sync` pytest runs.
- `mvn ... -Dsurefire.rerunFailingTestsCount=2 test-compile` is accepted on
Surefire 3.5.2; the cross-language module compiles with the timeout change.
### API
No public API change. All changes are test-configuration (CI scripts, the
Python test extra, and one e2e test agent).
### Documentation
- [ ] `doc-needed`
- [x] `doc-not-needed`
- [ ] `doc-included`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]