weiqingy commented on code in PR #713:
URL: https://github.com/apache/flink-agents/pull/713#discussion_r3329395823


##########
python/flink_agents/e2e_tests/e2e_tests_resource_cross_language/chat_model_cross_language_test.py:
##########
@@ -106,5 +106,6 @@ def test_java_chat_model_integration(
             with file.open() as f:
                 actual_result.extend(f.readlines())
 
-    assert "3" in actual_result[0]
-    assert "cat" in actual_result[1]
+    joined = "\n".join(actual_result).lower()
+    assert "3" in joined, f"math answer missing '3': {actual_result!r}"

Review Comment:
   Agreed — for a hotfix on the cross-language plumbing, the weaker math token 
is the right call; a stricter text check on a 1.7b CI model would just swap 
ordering-flakiness for capability-flakiness, and the `"math answer missing 
'3'"` message already documents the limitation inline. One out-of-scope idea 
for later: if the harness can surface it, asserting the `add` tool was actually 
invoked would be an order- and capability-independent signal. Looks good as is 
— thanks for the thorough write-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to