[LINK] "Time to agree on what Ai can and can't do .."

Stephen Loosley Fri, 01 Sep 2023 02:49:43 -0700


“ARTIFICIAL INTELLIGENCE”

Large language models aren’t people. Let’s stop testing them as if they were.

With hopes and fears about this technology running wild, it's time to agree on
what it can and can't do.

By Will Douglas Heaven August 30, 2023

https://www.technologyreview.com/2023/08/30/1078670/large-language-models-arent-people-lets-stop-testing-them-like-they-were/

When Taylor Webb played around with GPT-3 in early 2022, he was blown away by
what OpenAI’s large language model appeared to be able to do.

Here was a neural network trained only to predict the next word in a block of
text—a jumped-up autocomplete.

And yet it gave correct answers to many of the abstract problems that Webb set
for it—the kind of thing you’d find in an IQ test.

“I was really shocked by its ability to solve these problems,” he says.

“It completely upended everything I would have predicted.”

Webb is a psychologist at the University of California, Los Angeles, who
studies the different ways people and computers solve abstract problems.

He was used to building neural networks that had specific reasoning
capabilities bolted on.

But GPT-3 seemed to have learned them for free.

Last month Webb and his colleagues published an article in Nature, in which
they describe GPT-3’s ability to pass a variety of tests devised to assess the
use of analogy to solve problems (known as analogical reasoning).

On some of those tests GPT-3 scored better than a group of undergrads.

“Analogy is central to human reasoning,” says Webb.

“We think of it as being one of the major things that any kind of machine
intelligence would need to demonstrate.”

What Webb’s research highlights is only the latest in a long string of
remarkable tricks pulled off by large language models.

For example, when OpenAI unveiled GPT-3’s successor, GPT-4, in March, the
company published an eye-popping list of professional and academic assessments
that it claimed its new large language model had aced, including a couple of
dozen high school tests and the bar exam.

OpenAI later worked with Microsoft to show that GPT-4 could pass parts of the
United States Medical Licensing Examination.

And multiple researchers claim to have shown that large language models can
pass tests designed to identify certain cognitive abilities in humans, from
chain-of-thought reasoning (working through a problem step by step) to theory
of mind (guessing what other people are thinking).

These kinds of results are feeding a hype machine predicting that these
machines will soon come for white-collar jobs, replacing teachers, doctors,
journalists, and lawyers.

Geoffrey Hinton has called out GPT-4’s apparent ability to string together
thoughts as one reason he is now scared of the technology he helped create.

But there’s a problem: there is little agreement on what those results really
mean. Some people are dazzled by what they see as glimmers of human-like
intelligence; others aren’t convinced one bit.

“There are several critical issues with current evaluation techniques for large
language models,” says Natalie Shapira, a computer scientist at Bar-Ilan
University in Ramat Gan, Israel.

“It creates the illusion that they have greater capabilities than what truly
exists.”

That’s why a growing number of researchers—computer scientists, cognitive
scientists, neuroscientists, linguists—want to overhaul the way they are
assessed, calling for more rigorous and exhaustive evaluation.

Some think that the practice of scoring machines on human tests is wrongheaded,
period, and should be ditched.

“People have been giving human intelligence tests—IQ tests and so on—to
machines since the very beginning of AI,” says Melanie Mitchell, an
artificial-intelligence researcher at the Santa Fe Institute in New Mexico.

“The issue throughout has been what it means when you test a machine like this.
It doesn’t mean the same thing that it means for a human.”

“There’s a lot of anthropomorphizing going on,” she says.

“And that’s kind of coloring the way that we think about these systems and how
we test them.”

With hopes and fears for this technology at an all-time high, it is crucial
that we get a solid grip on what large language models can and cannot do.

Open to interpretation

Most of the problems with how large language models are tested boil down to the
question of how the results are interpreted.

Assessments designed for humans, like high school exams and IQ tests, take a
lot for granted. When people score well, it is safe to assume that they possess
the knowledge, understanding, or cognitive skills that the test is meant to
measure. (In practice, that assumption only goes so far.

Academic exams do not always reflect students’ true abilities. IQ tests measure
a specific set of skills, not overall intelligence. Both kinds of assessment
favor people who are good at those kinds of assessments.)

But when a large language model scores well on such tests, it is not clear at
all what has been measured.

Is it evidence of actual understanding? A mindless statistical trick? Rote
repetition?

“There is a long history of developing methods to test the human mind,” says
Laura Weidinger, a senior research scientist at Google DeepMind.

“With large language models producing text that seems so human-like, it is
tempting to assume that human psychology tests will be useful for evaluating
them. But that’s not true: human psychology tests rely on many assumptions that
may not hold for large language models.”

Webb is aware of the issues he waded into. “I share the sense that these are
difficult questions,” he says.

He notes that despite scoring better than undergrads on certain tests, GPT-3
produced absurd results on others.

For example, it failed a version of an analogical reasoning test about physical
objects that developmental psychologists sometimes give to kids.

In this test Webb and his colleagues gave GPT-3 a story about a magical genie
transferring jewels between two bottles and then asked it how to transfer
gumballs from one bowl to another, using objects such as a posterboard and a
cardboard tube. The idea is that the story hints at ways to solve the problem.

“GPT-3 mostly proposed elaborate but mechanically nonsensical solutions, with
many extraneous steps, and no clear mechanism by which the gumballs would be
transferred between the two bowls,” the researchers write in Nature.

“This is the sort of thing that children can easily solve,” says Webb. “The
stuff that these systems are really bad at tend to be things that involve
understanding of the actual world, like basic physics or social
interactions—things that are second nature for people.”

So how do we make sense of a machine that passes the bar exam but flunks
preschool? (part B follows)
_______________________________________________
Link mailing list
[email protected]
https://mailman.anu.edu.au/mailman/listinfo/link

[LINK] "Time to agree on what Ai can and can't do .."

Reply via email to