Recent math benchmarks for large language models (LLMs) such as MathArena 
indicate that state-of-the-art reasoning models achieve impressive performance 
on
mathematical competitions like AIME, with the leading model, O3-MINI, achieving 
scores comparable to top human competitors. 

However, these benchmarks evaluate models solely based on final numerical 
answers, neglecting rigorous reasoning and proof generation which are essential 
for real-world mathematical tasks.

To address this, we introduce the first comprehensive evaluation of 
full-solution reasoning for challenging mathematical problems. Using expert 
human annotators, we evaluated several state-of-the-art reasoning models on the 
six problems from the 2025 USAMO within hours of their release. 

Our results reveal that all tested models struggled significantly, achieving 
less than 5% on average.

<https://arxiv.org/pdf/2503.21934v1>

Reply via email to