[I] [Tech Debt] Flaky live-LLM e2e/cross-language CI tests cause frequent red CI [flink-agents]

via GitHub Sat, 30 May 2026 19:05:36 -0700


weiqingy opened a new issue, #716:
URL: https://github.com/apache/flink-agents/issues/716


   ### Search before asking
   
   - [x] I searched in the 
[issues](https://github.com/apache/flink-agents/issues) and found nothing 
similar.
   
   ### Description
   
   ### Summary
   
   Several end-to-end / cross-language CI tests that depend on a **live Ollama 
(`qwen3:1.7b`) server** fail intermittently, turning CI red on unrelated PRs 
and occasionally on `main` itself. The failures are not caused by the code 
under test — the same tests pass on re-run with no change. This issue proposes 
mitigating the flakiness so CI signal becomes trustworthy.
   
   ### Evidence (last ~35 `Flink Agents CI` runs)
   
   Across the most recent ~35 workflow runs, **~18 ended in failure**. 
Separating *real* failures from *flaky* ones:
   
   | Category | Jobs | Nature |
   |---|---|---|
   | **Flaky** (live-LLM e2e) | `cross-language`, `it-python` (e2e), `it-java` 
(e2e) | Non-deterministic — pass on re-run |
   | Real (excluded here) | `ut-python` snapshot/unit, `Code Style Check` | 
Deterministic, branch-specific — *not* part of this issue |
   
   The flaky failures reproduce **across many different branches, including 
`main`**, which confirms they come from shared test infrastructure rather than 
any single PR's code:
   
   ```
   main                          → httpx.ReadTimeout (×2)
   feature/cross_language_actions → ReadTimeout / assert ≠ 1386528
   fix/cross-language-test-order  → ReadTimeout / assert ≠ 1386528 (failed 6 of 
7 runs)
   docs/cross_language_actions    → httpx.ReadTimeout
   new_short_memory_ttl           → httpx.ReadTimeout
   220-impl                       → assert ≠ 1386528
   ```
   
   ### Two distinct flaky signatures
   
   **1. LLM non-determinism — wrong tool-calling result**
   
`flink_agents/e2e_tests/e2e_tests_integration/react_agent_test.py::test_react_agent_on_local_runner`
   
   The ReAct agent should compute `(2123 + 2321) * 312 = 1386528` via the 
`add`/`multiply` tools, but the 1.7B model returns a wrong value. Observed 
across runs (all on the *same* assertion, all different garbage values):
   
   ```
   assert 45750    == 1386528
   assert 4927483  == 1386528
   assert 432596736 == 1386528   (×2)
   ```
   
   The test already carries an in-line admission of the flakiness:
   > `"This may be caused by the LLM response does not match the output schema, 
you can rerun this case."`
   
   **2. Ollama request timeout**
   `ChatModelCrossLanguageTest.testChatModeIntegration` (and other e2e Ollama 
calls)
   
   ```
   Caused by: pemja.core.PythonException: <class 'httpx.ReadTimeout'>: timed out
   [ERROR] Tests run: 1, Failures: 0, Errors: 1 ... ChatModelCrossLanguageTest
   ```
   
   The HTTP read to the local Ollama server exceeds the client timeout under CI 
load. Notably the timeout is already partially addressed: 
`ChatModelCrossLanguageAgent.javaChatModelConnection()` sets `requestTimeout: 
240`, but `pythonChatModelConnection()` does **not**, so it falls back to the 
Python default `DEFAULT_REQUEST_TIMEOUT = 30.0` (`ollama_chat_model.py`). The 
failing prompts route through the Python-wrapped connection, i.e. the 30 s path.
   
   ### Root cause
   
   Both signatures stem from e2e/cross-language tests asserting **exact** 
outcomes from a **small, live LLM** under shared-runner load:
   - a 1.7B model is non-deterministic and sometimes gets tool-call arithmetic 
wrong, and
   - a cold `qwen3:1.7b` on a busy runner can exceed a 30 s read timeout.
   
   ### Proposed mitigation
   
   1. **Per-test retry for the live-LLM e2e tests (primary).** Add 
`pytest-rerunfailures` (`--reruns 2 --reruns-delay …`) for Python and Surefire 
`rerunFailingTestsCount` for Java, **scoped to the e2e / cross-language suites 
only** (e.g. via a marker), so deterministic unit tests are *not* 
retry-eligible and real regressions still fail loudly. This covers both 
signatures and every affected test.
   
   2. **Raise the timeout on the Python-wrapped connection (cheap 
complement).** Pass `request_timeout` to `pythonChatModelConnection()` 
mirroring the existing `requestTimeout: 240` on the Java connection. One line, 
with precedent in the same file. Removes the timeout trigger but does not help 
the wrong-answer case.
   
   3. **Optional — loosen exact-equality assertions over time.** For LLM-output 
tests, assert on schema/shape or that the correct tool was invoked, rather than 
an exact numeric value. Reduces dependence on a small model's correctness.
   
   Recommendation: **(1) is the load-bearing fix** (covers both failure modes 
across all affected tests); (2) is a cheap, well-scoped follow-on; (3) is a 
longer-term hardening direction.
   
   
   
   ### Are you willing to submit a PR?
   
   - [x] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Tech Debt] Flaky live-LLM e2e/cross-language CI tests cause frequent red CI [flink-agents]

Reply via email to