flyrain commented on issue #50:
URL: https://github.com/apache/polaris-tools/issues/50#issuecomment-3694484537

   > diff a lot based on the which llm model being used
   
   I observed the same thing in my experiments. In my testing, Claude Opus 4.5 
consistently produced the best overall results. The gap is significant enough 
that I would not recommend using other models at all. Maybe we can improve the 
context to make Claude sonnet 4.5 work as well, but I'm not quite sure.
   
   That said, I am not aware of a cost efficient way to evaluate it at scale. I 
also could not find a programmatic way to test Claude for free, which makes 
systematic benchmarking challenging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to