rosemarYuan commented on code in PR #713:
URL: https://github.com/apache/flink-agents/pull/713#discussion_r3329752117


##########
python/flink_agents/e2e_tests/e2e_tests_resource_cross_language/chat_model_cross_language_test.py:
##########
@@ -106,5 +106,6 @@ def test_java_chat_model_integration(
             with file.open() as f:
                 actual_result.extend(f.readlines())
 
-    assert "3" in actual_result[0]
-    assert "cat" in actual_result[1]
+    joined = "\n".join(actual_result).lower()
+    assert "3" in joined, f"math answer missing '3': {actual_result!r}"

Review Comment:
   Thanks, agreed. Given that the current harness only reads the file-sink text 
output, keeping the weak `"3"` check is reasonable for this hotfix and avoids 
turning it into another 1.7b model-capability flaky test.
   
   For a follow-up, I agree that surfacing tool-invocation events would be a 
stronger signal. One nuance is that tool invocation and final-answer 
correctness are separate dimensions. From previous runs, we have seen several 
different behaviors: the model may answer directly without tools, call the tool 
with correct arguments, call the tool with hallucinated/wrong arguments, miss a 
later calculation step, emit a tool call as plain text instead of an actual 
tool call, get the correct tool result but still produce a wrong final answer, 
or return a response that does not match the expected schema. In conclusion, 
successfully calling a tool does not necessarily equate to outputting the 
correct answer.
   
   So if the harness can expose tool events later, checking that `add` was 
invoked would be a stronger signal for the tool-calling path than scanning the 
text output alone. To make that check more meaningful, we may also want to 
validate the tool arguments, e.g. that `add` was invoked with the expected 
inputs, and keep final-output validation as a separate concern.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to