[nexa] When billion-dollar AIs break down over puzzles a child can do, it’s time to rethink the hype

Daniela Tafani Tue, 10 Jun 2025 14:57:11 -0700

When billion-dollar AIs break down over puzzles a child can do, it’s time to 
rethink the hype
Gary Marcus

The tech world is reeling from a paper that shows the powers of a new
generation of AI have been wildly oversold

A research paper by Apple has taken the tech world by storm, all but
eviscerating the popular notion that large language models (LLMs, and their
newest variant, LRMs, large reasoning models) are able to reason reliably. Some
are shocked by it, some are not. The well-known venture capitalist Josh Wolfe
went so far as to post on X that “Apple [had] just GaryMarcus’d LLM reasoning
ability” – coining a new verb (and a compliment to me), referring to “the act
of critically exposing or debunking the overhyped capabilities of artificial
intelligence … by highlighting their limitations in reasoning, understanding,
or general intelligence”.

Apple did this by showing that leading models such as ChatGPT, Claude and
Deepseek may “look smart – but when complexity rises, they collapse”. In short,
these models are very good at a kind of pattern recognition, but often fail
when they encounter novelty that forces them beyond the limits of their
training, despite being, as the paper notes, “explicitly designed for reasoning
tasks”.

As discussed later, there is a loose end that the paper doesn’t tie up, but on
the whole, its force is undeniable. So much so that LLM advocates are already
partly conceding the blow while hinting at, or at least hoping for, happier
futures ahead.

In many ways the paper echoes and amplifies an argument that I have been making
since 1998: neural networks of various kinds can generalise within a
distribution of data they are exposed to, but their generalisations tend to
break down beyond that distribution. A simple example of this is that I once
trained an older model to solve a very basic mathematical equation using only
even-numbered training data. The model was able to generalise a little bit:
solve for even numbers it hadn’t seen before, but unable to do so for problems
where the answer was an odd number.

More than a quarter of a century later, when a task is close to the training
data, these systems work pretty well. But as they stray further away from that
data, they often break down, as they did in the Apple paper’s more stringent
tests. Such limits arguably remain the single most important serious weakness
in LLMs.

The hope, as always, has been that “scaling” the models by making them bigger,
would solve these problems. The new Apple paper resoundingly rebuts these
hopes. They challenged some of the latest, greatest, most expensive models with
classic puzzles, such as the Tower of Hanoi – and found that deep problems
lingered. Combined with numerous hugely expensive failures in efforts to build
GPT-5 level systems, this is very bad news.

The Tower of Hanoi is a classic game with three pegs and multiple discs, in
which you need to move all the discs on the left peg to the right peg, never
stacking a larger disc on top of a smaller one. With practice, though, a bright
(and patient) seven-year-old can do it.

What Apple found was that leading generative models could barely do seven
discs, getting less than 80% accuracy, and pretty much can’t get scenarios with
eight discs correct at all. It is truly embarrassing that LLMs cannot reliably
solve Hanoi.

And, as the paper’s co-lead-author Iman Mirzadeh told me via DM, “it’s not just
about ‘solving’ the puzzle. We have an experiment where we give the solution
algorithm to the model, and [the model still failed] … based on what we observe
from their thoughts, their process is not logical and intelligent”.

The new paper also echoes and amplifies several arguments that Arizona State
University computer scientist Subbarao Kambhampati has been making about the
newly popular LRMs. He has observed that people tend to anthropomorphise these
systems, to assume they use something resembling “steps a human might take when
solving a challenging problem”. And he has previously shown that in fact they
have the same kind of problem that Apple documents.

If you can’t use a billion-dollar AI system to solve a problem that Herb Simon
(one of the actual godfathers of AI) solved with classical (but out of fashion)
AI techniques in 1957, the chances that models such as Claude or o3 are going
to reach artificial general intelligence (AGI) seem truly remote.

So what’s the loose thread that I warn you about? Well, humans aren’t perfect
either. On a puzzle like Hanoi, ordinary humans actually have a bunch of
(well-known) limits that somewhat parallel what the Apple team discovered. Many
(not all) humans screw up on versions of the Tower of Hanoi with eight discs.

But look, that’s why we invented computers, and for that matter calculators: to
reliably compute solutions to large, tedious problems. AGI shouldn’t be about
perfectly replicating a human, it should be about combining the best of both
worlds; human adaptiveness with computational brute force and reliability. We
don’t want an AGI that fails to “carry the one” in basic arithmetic just
because sometimes humans do.

Whenever people ask me why I actually like AI (contrary to the widespread myth
that I am against it), and think that future forms of AI (though not
necessarily generative AI systems such as LLMs) may ultimately be of great
benefit to humanity, I point to the advances in science and technology we might
make if we could combine the causal reasoning abilities of our best scientists
with the sheer compute power of modern digital computers.

What the Apple paper shows, most fundamentally, regardless of how you define
AGI, is that these LLMs that have generated so much hype are no substitute for
good, well-specified conventional algorithms. (They also can’t play chess as
well as conventional algorithms, can’t fold proteins like special-purpose
neurosymbolic hybrids, can’t run databases as well as conventional databases,
etc.)

What this means for business is that you can’t simply drop o3 or Claude into
some complex problem and expect them to work reliably. What it means for
society is that we can never fully trust generative AI; its outputs are just
too hit-or-miss.

One of the most striking findings in the new paper was that an LLM may well
work in an easy test set (such as Hanoi with four discs) and seduce you into
thinking it has built a proper, generalisable solution when it has not.

To be sure, LLMs will continue to have their uses, especially for coding and
brainstorming and writing, with humans in the loop.

But anybody who thinks LLMs are a direct route to the sort of AGI that could
fundamentally transform society for the good is kidding themselves.

<https://www.theguardian.com/commentisfree/2025/jun/10/billion-dollar-ai-puzzle-break-down>

[nexa] When billion-dollar AIs break down over puzzles a child can do, it’s time to rethink the hype

Reply via email to