[nexa] PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD

Daniela Tafani Thu, 03 Apr 2025 02:58:18 -0700

Recent math benchmarks for large language models (LLMs) such as MathArena 
indicate that state-of-the-art reasoning models achieve impressive performance 
on
mathematical competitions like AIME, with the leading model, O3-MINI, achieving 
scores comparable to top human competitors.


However, these benchmarks evaluate models solely based on final numerical 
answers, neglecting rigorous reasoning and proof generation which are essential 
for real-world mathematical tasks.

To address this, we introduce the first comprehensive evaluation of 
full-solution reasoning for challenging mathematical problems. Using expert 
human annotators, we evaluated several state-of-the-art reasoning models on the 
six problems from the 2025 USAMO within hours of their release. 

Our results reveal that all tested models struggled significantly, achieving 
less than 5% on average.

<https://arxiv.org/pdf/2503.21934v1>

[nexa] PROOF OR BLUFF? EVALUATING LLMS ON 2025 USA MATH OLYMPIAD

Reply via email to