weiqingy opened a new issue, #716: URL: https://github.com/apache/flink-agents/issues/716
### Search before asking - [x] I searched in the [issues](https://github.com/apache/flink-agents/issues) and found nothing similar. ### Description ### Summary Several end-to-end / cross-language CI tests that depend on a **live Ollama (`qwen3:1.7b`) server** fail intermittently, turning CI red on unrelated PRs and occasionally on `main` itself. The failures are not caused by the code under test — the same tests pass on re-run with no change. This issue proposes mitigating the flakiness so CI signal becomes trustworthy. ### Evidence (last ~35 `Flink Agents CI` runs) Across the most recent ~35 workflow runs, **~18 ended in failure**. Separating *real* failures from *flaky* ones: | Category | Jobs | Nature | |---|---|---| | **Flaky** (live-LLM e2e) | `cross-language`, `it-python` (e2e), `it-java` (e2e) | Non-deterministic — pass on re-run | | Real (excluded here) | `ut-python` snapshot/unit, `Code Style Check` | Deterministic, branch-specific — *not* part of this issue | The flaky failures reproduce **across many different branches, including `main`**, which confirms they come from shared test infrastructure rather than any single PR's code: ``` main → httpx.ReadTimeout (×2) feature/cross_language_actions → ReadTimeout / assert ≠ 1386528 fix/cross-language-test-order → ReadTimeout / assert ≠ 1386528 (failed 6 of 7 runs) docs/cross_language_actions → httpx.ReadTimeout new_short_memory_ttl → httpx.ReadTimeout 220-impl → assert ≠ 1386528 ``` ### Two distinct flaky signatures **1. LLM non-determinism — wrong tool-calling result** `flink_agents/e2e_tests/e2e_tests_integration/react_agent_test.py::test_react_agent_on_local_runner` The ReAct agent should compute `(2123 + 2321) * 312 = 1386528` via the `add`/`multiply` tools, but the 1.7B model returns a wrong value. Observed across runs (all on the *same* assertion, all different garbage values): ``` assert 45750 == 1386528 assert 4927483 == 1386528 assert 432596736 == 1386528 (×2) ``` The test already carries an in-line admission of the flakiness: > `"This may be caused by the LLM response does not match the output schema, you can rerun this case."` **2. Ollama request timeout** `ChatModelCrossLanguageTest.testChatModeIntegration` (and other e2e Ollama calls) ``` Caused by: pemja.core.PythonException: <class 'httpx.ReadTimeout'>: timed out [ERROR] Tests run: 1, Failures: 0, Errors: 1 ... ChatModelCrossLanguageTest ``` The HTTP read to the local Ollama server exceeds the client timeout under CI load. Notably the timeout is already partially addressed: `ChatModelCrossLanguageAgent.javaChatModelConnection()` sets `requestTimeout: 240`, but `pythonChatModelConnection()` does **not**, so it falls back to the Python default `DEFAULT_REQUEST_TIMEOUT = 30.0` (`ollama_chat_model.py`). The failing prompts route through the Python-wrapped connection, i.e. the 30 s path. ### Root cause Both signatures stem from e2e/cross-language tests asserting **exact** outcomes from a **small, live LLM** under shared-runner load: - a 1.7B model is non-deterministic and sometimes gets tool-call arithmetic wrong, and - a cold `qwen3:1.7b` on a busy runner can exceed a 30 s read timeout. ### Proposed mitigation 1. **Per-test retry for the live-LLM e2e tests (primary).** Add `pytest-rerunfailures` (`--reruns 2 --reruns-delay …`) for Python and Surefire `rerunFailingTestsCount` for Java, **scoped to the e2e / cross-language suites only** (e.g. via a marker), so deterministic unit tests are *not* retry-eligible and real regressions still fail loudly. This covers both signatures and every affected test. 2. **Raise the timeout on the Python-wrapped connection (cheap complement).** Pass `request_timeout` to `pythonChatModelConnection()` mirroring the existing `requestTimeout: 240` on the Java connection. One line, with precedent in the same file. Removes the timeout trigger but does not help the wrong-answer case. 3. **Optional — loosen exact-equality assertions over time.** For LLM-output tests, assert on schema/shape or that the correct tool was invoked, rather than an exact numeric value. Reduces dependence on a small model's correctness. Recommendation: **(1) is the load-bearing fix** (covers both failure modes across all affected tests); (2) is a cheap, well-scoped follow-on; (3) is a longer-term hardening direction. ### Are you willing to submit a PR? - [x] I'm willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
