flyrain commented on issue #50: URL: https://github.com/apache/polaris-tools/issues/50#issuecomment-3694484537
> diff a lot based on the which llm model being used I observed the same thing in my experiments. In my testing, Claude Opus 4.5 consistently produced the best overall results. The gap is significant enough that I would not recommend using other models at all. Maybe we can improve the context to make Claude sonnet 4.5 work as well, but I'm not quite sure. That said, I am not aware of a cost efficient way to evaluate it at scale. I also could not find a programmatic way to test Claude for free, which makes systematic benchmarking challenging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
