On Wednesday, December 25, 2024 at 1:46:51 PM UTC+8 Brent Meeker wrote:


I think that LLM thinking is not like reasoning.  It reminds me of my 
grandfather who was a cattleman in Texas.  I used to go to auction with 
him where he would buy calves to raise and where he would auction off 
one's he had raised.  He could do calculations of what to pay, gain and 
loss, expected prices, cost of feed all in his head almost instantly.  
But he couldn't explain how he did it.  He could do it pencil and paper 
and explain that; but not how he did it in his head.  So although the 
arithmetic would be of the same kind he couldn't figure insurance rates 
and payouts, or medical expenses, or home construction costs in his 
head.  The difference with LLM's is they have absorbed so many examples 
on every subject, as my grandfather had of auctioning cattle, that the 
LLM's don't have reasoning, they have finely developed intuition, and 
they have it about every subject.  Humans don't have the capacity to 
develop that level of intuition about more that one or two subjects; 
beyond that they have to rely on slow, formal reasoning.


Nail on the head. That’s an interesting story and got me thinking (too much 
again). This reminds me of the distinction in Kahneman’s bestseller from 
2011 between “type 1” and “type 2” thinking (aka “system 1” and “system 
2”). And although there are problems/controversies with his ideas/evidence, 
as is almost standard with psychological proposals of this kind, we can, if 
you indulge me, take his ideas less literally describing loose cognitive 
styles of reasoning as follows: 

LLMs today (end of 2024), with their vast memory-like parameter sets, are 
closer to system 1 “intuitive reasoning styles” that appear startlingly 
broad and context-sensitive—even "fine", as you say, somewhat related to 
your grandfather's style of reasoning in their area(s) of expertise. Yet, 
in my view, LLMs (not wishing to imply anything about your grandfather) are 
not performing the slower, more deliberative, rule-based thinking we often 
associate with system 2.

Take the example of Magnus Carlsen, arguably one of the best performing 
chess players in recent history, relying increasingly on instinctive, “type 
1” moves, after thousands of hours of deliberate “type 2” calculation (when 
he was young and ascending the throne, he was more known for his 
calculation skills). This illustrates how extensive practice can shift 
certain skills from a laborious, step-by-step process toward fast, 
experience-based pattern recognition. I’ve observed many times how he looks 
at the clock in a time crunch situation and makes decisive, critical, 
superior computeresque moves, based purely on a more precise and finely 
tuned gut feeling, than his opponent. Especially with too little time to 
calculate his way through a situation in recent years. In streams where he 
comments on other GM’s play, he often instantly blurts out statements like 
“that knight just belongs on e6” or “the position is screaming for bishop 
f8” and similar, without a second of thought or looking at the AI/engine 
evaluation.

A similar thing happens, perhaps not at a world class level (but even 
Magnus makes mistakes, just less often than others), when a person learns 
to drive: at first, every procedure (shifting gears, 360 degree checks, 
mirror vigilance, steering) is deliberate and conscious, whereas a 
practiced driver performs multiple operations fluently/simultaneously 
without thinking in type 2 manner. We see that in most advanced tasks—like 
playing chess, doing math, or solving puzzles—humans merge both system 1 
and system 2, but they may rely more on one depending on experience and 
context.

I see current LLMs as occupying a distinct spot: the “intuition” they rely 
on is not gained through slow, personal experience in one domain, but 
rather from the vast  breadth of their pretraining across nearly every 
field for which text data or code exists. Again, it’s not just static 
memory and I do recognize the nuance that they are *not just memorizing 
answers to questions or merely content.* AGI advocates scream: “LLMs 
interpolate, so its reasoning!” Unfortunately, they don’t often specify 
what that means. 

What they primarily memorize, if I understand correctly is 
functions/programs in a certain way. Programs do generalize to some extent 
by mathematical definition. When John questions an LLM via his keyboard, he 
is *essentially querying a point in program space, *where we can think of 
the LLM in some domain as a manifold with each point encoding a 
function/program. Yes, they do interpolate across these manifolds to 
combine/compose programs which implies an infinite number of possible 
programs that John can choose from through his keyboard. That is why they* 
appear* to reason richly and organically, unlike earlier years and can help 
debug code (as they have been trained against compositions that yield false 
results) in sophisticated manners *with human assistance.*

So what we’re doing with LLMs is that we’re training them as rich flexible 
models to predict the next token. And if we had infinite memory capacity, 
we could simply train them to learn a sort of lookup table. But the reality 
is more modest: LLMs only have some billion or trillion  parameters. That’s 
why they screw up basic grade school math problems not in their training 
set or do not manage to remove a single small element John doesn’t want in 
the complex image he’s generated with his favorite image generator. *Therefore 
it cannot learn a lookup table for every possible input sequence relating 
to its training data. It is forced to compress. So what these programs 
“learn” is predictive functions that take and detect the form of vector 
functions because the LLM is a curve… and the only thing we can encode with 
a curve is a set of vector functions. These take elements of the entry 
sequence as inputs and add output elements of what follows. *

Say John feeds it the works of Oscar Wilde for the first time and the LLM 
has already “learned” a model of the English language. The text that John 
has inputed is slightly different and yet still the English language. And 
that’s why it’s so good at emulating linguistic styles and elegance or lack 
thereof, as its possible to model Oscar Wilde by reusing a lot of functions 
of learning to model English in general. *It therefore becomes trivial to 
model Wilde’s style by simply deriving a style transfer function that will 
go from the model of English to the Wilde style texts*. That’s how/why 
people are amazed at its linguistic dexterity in this sense.

That’s why they appear to have an “intuitive” command of everything—law, 
biology, programming, philosophy—despite lacking the capacity to handle 
tasks that need deeper, stepwise reasoning or dynamic re-planning in new 
contexts. This resonates with your grandfather’s “cattle auction” case: he 
could handle his specialized domain through near-instant intuition but 
needed pencil-and-paper for general arithmetic outside of it, if I 
understand correctly. LLMs similarly, may seem “fluent” in many subjects, 
yet they cannot truly engage in what we would label “system 2 program 
synthesis”, unless of course the training distribution already covers a 
close analogue.

In short, the reason LLM outputs feel like “type 1 reasoning on steroids” 
is that these models have memorized so many examples that their combined 
intuition extends across nearly all known textual domains. But when a 
problem truly demands formal reasoning steps absent from their training 
data, LLMs lack a real “type 2” counterpart—no robust self-critique, no 
internal program writing, and no persistent memory to refine their logic. 
We can therefore liken them to formidable intuition machines without the 
same embedded capacity for system 2, top-down reasoning or architectural 
self-modification that we see in real human skill acquisition. My big 
mistake is of course that John would never read Oscar Wilde. 

As a short note to Russell: yes, they appear to be competent at such 
assistance for similar reasons. I don't see advancements in the coding use 
case as a function purely of scaling and compute. The more advanced models 
can assist in such ways because they have longer histories of having 
suppressed "wrong" function chains/interpolation. Thx for the inspo guys.

-- 
You received this message because you are subscribed to the Google Groups 
"Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/everything-list/ad612720-63e4-4837-8a48-13540c32b337n%40googlegroups.com.

Reply via email to