[nexa] Chatbot Arena is the most popular AI benchmarking tool, but new research says its scores are misleading and benefit a handful of the biggest companies

Daniela Tafani Thu, 01 May 2025 10:51:56 -0700

Researchers Say the Most Popular Tool for Grading AIs Unfairly Favors Meta, 
Google, OpenAI


The most popular method for measuring what are the best chatbots in the world 
is flawed and frequently manipulated by powerful companies like OpenAI and 
Google in order to make their products seem better than they actually are, 
according to a new paper from researchers at the AI company Cohere, as well as 
Stanford, MIT, and other universities.

The researchers came to this conclusion after reviewing data that’s made public 
by Chatbot Arena (also known as LMArena and LMSYS), which facilitates 
benchmarking and maintains the leaderboard listing the best large language 
models, as well as scraping Chatbot Arena and their own testing. Chatbot Arena, 
meanwhile, has responded to the researchers findings by saying that while it 
accepts some criticisms and plans to address them, some of the numbers the 
researchers presented are wrong and mischaracterize how Chatbot Arena actually 
ranks LLMs. The research was published just weeks after Meta was accused of 
gaming AI benchmarks with one of its recent models. 

If you’re wondering why this beef between the researchers, Chatbot Arena, and 
others in the AI industry matters at all, consider the fact that the biggest 
tech companies in the world as well as a great number of lesser known startups 
are currently in a fierce competition to develop the most advanced AI tools, 
operating under the belief that these AI tools will define the future of 
humanity and enrich the most successful companies in this industry in a way 
that will make previous technology booms seem minor by comparison. 

I should note here that Cohere is an AI company that produces its own models 
and that they don’t appear to rank very highly in the Chatbot Arena 
leaderboard. The researchers also make the point that proprietary closed models 
from competing companies appear to have an unfair advantage to open-source 
models, and that Cohere proudly boasts that its model Aya is “one of the 
largest open science efforts in ML to date.” In other words, the research is 
coming from a company that Chatbot Arena doesn’t benefit. 

Judging which large language model is the best is tricky because different 
people use different AI models for different purposes and what is the “best” 
result is often subjective, but the desire to compete and compare these models 
has made the AI industry default to the practice of benchmarking AI models. 
Specifically, Chatbot Arena, which gives a numerical “Arena Score” to models 
companies submit and maintains a leaderboard listing the highest scoring 
models. At the moment, for example, Google’s Gemini 2.5 Pro is in the number 
one spot, followed by OpenAI’s o3, ChatGPT 4o, and X’s Grok 3. 

The vast majority of people who use these tools probably have no idea the 
Chatbot Arena leaderboard exists, but it is a big deal to AI enthusiasts, CEOs, 
investors, researchers, and anyone who actively works or is invested in the AI 
industry. The significance of the leaderboard also remains despite the fact 
that it has been criticized extensively over time for the reasons I list above. 
The stakes of the AI race and who will win it are objectively very high in 
terms of the money that’s being poured into this space and the amount of time 
and energy people are spending on winning it, and Chatbot Arena, while flawed, 
is one of the few places that’s keeping score. 

“A meaningful benchmark demonstrates the relative merits of new research ideas 
over existing ones, and thereby heavily influences research directions, funding 
decisions, and, ultimately, the shape of progress in our field,” the 
researchers write in their paper, titled “The Leaderboard illusion.” “The 
recent meteoric rise of generative AI models—in terms of public attention, 
commercial adoption, and the scale of compute and funding involved—has 
substantially increased the stakes and pressure placed on leaderboards.”

The way that Chatbot Arena works is that anyone can go to its site and type in 
a prompt or question. That prompt is then given to two anonymous models. The 
user can’t see what the models are, but in theory one model could be ChatGPT 
while the other is Anthropic’s Claude. The user is then presented with the 
output from each of these models and votes for the one they think did a better 
job. Multiply this process by millions of votes and that’s how Chatbot Arena 
determines who is placed where on the leaderboards. Deepseek, the Chinese AI 
model that rocked the industry when it was released in January, is currently 
ranked #7 on the leaderboard, and its high score was part of the reason people 
were so impressed. 

According to the researchers’ paper, the biggest problem with this method is 
that Chatbot Arena is allowing the biggest companies in this space, namely 
Google, Meta, Amazon, and OpenAI, to run “undisclosed private testing” and 
cherrypick their best model. The researchers said their systemic review of 
Chatbot Arena involved combining data sources encompassing 2 million “battles,” 
auditing 42 providers and 243 models between January 2024 and April 2025. 

“This comprehensive analysis reveals that over an extended period, a handful of 
preferred providers have been granted disproportionate access to data and 
testing,” the researchers wrote. “In particular, we identify an undisclosed 
Chatbot Arena policy that allows a small group of preferred model providers to 
test many model variants in private before releasing only the best-performing 
checkpoint.”

Basically, the researchers claim that companies test their LLMs on Chatbot 
Arena to find which models score best, without those tests counting towards 
their public score. Then they pick the model that scores best for official 
testing.

Chatbot Arena says the researchers’ framing here is misleading. 

“We designed our policy to prevent model providers from just reporting the 
highest score they received during testing. We only publish the score for the 
model they release publicly,” it said on X. 

“In a single month, we observe as many as 27 models from Meta being tested 
privately on Chatbot Arena in the lead up to Llama 4 release,” the researchers 
said. “Notably, we find that Chatbot Arena does not require all submitted 
models to be made public, and there is no guarantee that the version appearing 
on the public leaderboard matches the publicly available API.”

In early April, when Meta’s model Maverick shot up to the second spot of the 
leaderboard, users were confused because they didn’t find it that good and 
better than other models that ranked below it. As Techcrunch noted at the time, 
that might be because Meta used a slightly different version of the model 
“optimized for conversationality” on Chatbot Arena than what users had access 
to.

“We helped Meta with pre-release testing for Llama 4, like we have helped many 
other model providers in the past,” Chatbot Arena said in response to the 
research paper. “We support open-source development. Our own platform and 
analysis tools are open source, and we have released millions of open 
conversations as well. This benefits the whole community.”

The researchers also claim that makers or proprietary models, like OpenAI and 
Google, collect far more data from their testing on Chatbot Arena than fully 
open-source models, which allows them to better fine tune the model to what 
Chatbot Arena users want. 

That last part on its own might be the biggest problem with Chatbot Arena’s 
leaderboard in the long term, since it incentivizes the people who create AI 
models to design them in a way that scores well on Chatbot Arena as opposed to 
what might make them materially better and safer for users in a real world 
environment. 

As the researchers write: “the over-reliance on a single leaderboard creates a 
risk that providers may overfit to the aspects of leaderboard performance, 
without genuinely advancing the technology in meaningful ways. As Goodhart’s 
Law states, when a measure becomes a target, it ceases to be a good measure.”

Despite their criticism, the researchers acknowledge the contribution of 
Chatbot Arena to AI research and that it serves a need, and their paper ends 
with a list of recommendations on how to make it better, including preventing 
companies from retracting scores after submission, being more transparent which 
models engage in private testing and how much.

“One might disagree with human preferences—they’re subjective—but that’s 
exactly why they matter,” Chatbot Arena said on X in response to the paper. 
“Understanding subjective preference is essential to evaluating real-world 
performance, as these models are used by people. That’s why we’re working on 
statistical methods—like style and sentiment control—to decompose human 
preference into its constituent parts. We are also strengthening our user base 
to include more diversity. And if pre-release testing and data helps models 
optimize for millions of people’s preferences, that’s a positive thing!”

“If a model provider chooses to submit more tests than another model provider, 
this does not mean the second model provider is treated unfairly,” it added. 
“Every model provider makes different choices about how to use and value human 
preferences.”

<https://www.404media.co/chatbot-arena-illusion-paper-meta-openai/>

[nexa] Chatbot Arena is the most popular AI benchmarking tool, but new research says its scores are misleading and benefit a handful of the biggest companies

Reply via email to