rosemarYuan commented on code in PR #713:
URL: https://github.com/apache/flink-agents/pull/713#discussion_r3328416464
##########
python/flink_agents/e2e_tests/e2e_tests_resource_cross_language/chat_model_cross_language_test.py:
##########
@@ -106,5 +106,6 @@ def test_java_chat_model_integration(
with file.open() as f:
actual_result.extend(f.readlines())
- assert "3" in actual_result[0]
- assert "cat" in actual_result[1]
+ joined = "\n".join(actual_result).lower()
+ assert "3" in joined, f"math answer missing '3': {actual_result!r}"
Review Comment:
Thanks for flagging this — the concern is valid. A stricter math assertion
would re-introduce flakiness, and those failures would be model-capability
noise rather than actual cross-language regressions. So I think accepting the
weaker math signal is a reasonable trade-off for this hotfix.
The way I see it, this is two different problems:
**(1) E2E cross-language behavioral consistency** — the primary goal of this
test. The order-insensitive join + lowercased check addresses this, and the
current approach prioritizes it.
**(2) Model output quality validation** — a harder problem that a 1.7b model
on unstable CI hardware is fundamentally ill-suited for. If we want to
strengthen this later, some possible directions might be:
- Upgrading the CI model to one with more reliable arithmetic capability;
- Structuring and formalizing the prompt (e.g., explicit chain-of-thought
with strict output formatting);
- Adding a post-inference verification step to verify whether the model
output meets the Prompt expectation before the assertion is run.
These improvements to (2) are out of scope for this hotfix. Would love to
hear your thoughts on whether this trade-off works for now, or if you'd prefer
a different approach.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]