rosemarYuan commented on code in PR #713:
URL: https://github.com/apache/flink-agents/pull/713#discussion_r3329752117
##########
python/flink_agents/e2e_tests/e2e_tests_resource_cross_language/chat_model_cross_language_test.py:
##########
@@ -106,5 +106,6 @@ def test_java_chat_model_integration(
with file.open() as f:
actual_result.extend(f.readlines())
- assert "3" in actual_result[0]
- assert "cat" in actual_result[1]
+ joined = "\n".join(actual_result).lower()
+ assert "3" in joined, f"math answer missing '3': {actual_result!r}"
Review Comment:
Thanks, agreed. Given that the current harness only reads the file-sink text
output, keeping the weak `"3"` check is reasonable for this hotfix and avoids
turning it into another 1.7b model-capability flaky test.
For a follow-up, I agree that surfacing tool-invocation events would be a
stronger signal. One nuance is that tool invocation and final-answer
correctness are separate dimensions. From previous runs, we have seen several
different behaviors: the model may answer directly without tools, call the tool
with correct arguments, call the tool with hallucinated/wrong arguments, miss a
later calculation step, emit a tool call as plain text instead of an actual
tool call, get the correct tool result but still produce a wrong final answer,
or return a response that does not match the expected schema. In conclusion,
successfully calling a tool does not necessarily equate to outputting the
correct answer.
So if the harness can expose tool events later, checking that `add` was
invoked would be a stronger signal for the tool-calling path than scanning the
text output alone. To make that check more meaningful, we may also want to
validate the tool arguments, e.g. that `add` was invoked with the expected
inputs, and keep final-output validation as a separate concern.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]