An amusing anecdote:

Double descent aka Grokking struck me as so obvious that I thought I must
be mistaken about it since everyone seemed so mystified.

What seemed obvious was the approximation of Algorithmic Information that
must take place when one reduces the magnitude of the parameters under
regularization.  The magnitude of the regularization term in loss functions
is usually far below that of the error term -- so naturally you get rid of
the error first and it looks like it's leveled off when, in fact, it has
merely entered into a much smaller gradient descent.  This seemed
obvious to me for the same reason I intuited that lossless compression
approximated learning hence found Matt Mahoney and Marcus Hutter:

Waaay back in IJCNN #2 I was using DataCube image processing system board
that used Xilinx convolution hardware (ie: GPU before GPU was cool) to do
neural image segmentation using quantized weights -- and ran across the
advantages of pruning weights for generalization.  This was just hands-on
practical knowledge, sort of like a German machinist being told to take a
block of metal home with him and rub it and rub it and rub it to get a feel
for it.

So it's obvious -- right?  I mean, what's going on during "Grokking" is
just a reduction in the complexity of the parameters -- which should be
measurable even without looking at the validation scores.

Or is it?

Well, I don't know.  Has anyone bothered to measure this?

Anyway, so another thing that's always seemed obvious to me is that you
don't need all that much data to "triangulate" on reality -- all you need
is a good diversity of viewpoints and you can start to squeeze out of the
morass of perspectives on reality what is "bias" in
measurement/model/motivated reasoning/error and what is more or less an
accurate global view aka "reality".  Hence Wikipedia should be a sweet spot
between having enough data and resource management of the modeling process
(whatever the mix of human and machine resources).

Of course, it always helps to have "clean" data but then that's what
Charlie Smith told me took *over 90%* of the resources when he co-founded
the Energy Information Administration -- the other 10% modeling the
dynamics of the energy economy.  (Which, BTW, was what motivated him to
finance the second neural net summer when he got control of the Systems
Development Foundation starting with Werbos's work on RNNs and Hinton.)
Forensic epistemology is an inescapably enormous burden that is on a
continuum with "lossless compression" of whatever dataset you go with to
find the dynamical systems comprising your world model.  It's all lossless
compression -- it's just that in the case of data cleaning, you start with
raw measurements straight out of your measurement instruments (including
institutions that report data), and you need to record your forensic
epistemology in an algorithmic form if you want to be accountable.

So, ok, here I am waiting around for "data efficiency" to make some headway
despite the damage done by the likes of Yann LeCun claiming you need more
and more organic data and others claiming you need more and more synthetic
data, blah blah blah.  Then, a few weeks ago I ran across a paper "Grokking:
Generalization Beyond Overfitting On Small Algorithmic Datasets
<https://arxiv.org/pdf/2201.02177>" which seemed to me to be the right
approach *given* "clean" datasets.  Once this approach matured, we might
have some hope of formalizing what Charlie was saying about "data cleaning"
and get some accountability into the machine learning world about "bias".

So I told my niece who recently graduated with an emphasis in AI to look at
that paper and some videos about it as an example of what I'd been trying
to explain about the importance of lossless compression, "bias" and data
efficiency.  She got the importance but, of course, since she's running
around trying to find work in AI along with a lot of other recently
graduating peers, everything is all about "the latest thing". Come to find
out that paper was ANCIENT .... I mean 2 years old!  That's practically
from IJCNN #2!

However, "the latest thing" is now "grokfast
<https://arxiv.org/abs/2405.20233>" which, as luck would have it, is based
on code inherited from that ANCIENT paper.  See the Acknowledgements at the
end of the grokfast README at github <https://github.com/ironjr/grokfast>.

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T2f8f4e2c54894e9b-Mbe7c97cd5fe045896833cad1
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to