An amusing anecdote: Double descent aka Grokking struck me as so obvious that I thought I must be mistaken about it since everyone seemed so mystified.
What seemed obvious was the approximation of Algorithmic Information that must take place when one reduces the magnitude of the parameters under regularization. The magnitude of the regularization term in loss functions is usually far below that of the error term -- so naturally you get rid of the error first and it looks like it's leveled off when, in fact, it has merely entered into a much smaller gradient descent. This seemed obvious to me for the same reason I intuited that lossless compression approximated learning hence found Matt Mahoney and Marcus Hutter: Waaay back in IJCNN #2 I was using DataCube image processing system board that used Xilinx convolution hardware (ie: GPU before GPU was cool) to do neural image segmentation using quantized weights -- and ran across the advantages of pruning weights for generalization. This was just hands-on practical knowledge, sort of like a German machinist being told to take a block of metal home with him and rub it and rub it and rub it to get a feel for it. So it's obvious -- right? I mean, what's going on during "Grokking" is just a reduction in the complexity of the parameters -- which should be measurable even without looking at the validation scores. Or is it? Well, I don't know. Has anyone bothered to measure this? Anyway, so another thing that's always seemed obvious to me is that you don't need all that much data to "triangulate" on reality -- all you need is a good diversity of viewpoints and you can start to squeeze out of the morass of perspectives on reality what is "bias" in measurement/model/motivated reasoning/error and what is more or less an accurate global view aka "reality". Hence Wikipedia should be a sweet spot between having enough data and resource management of the modeling process (whatever the mix of human and machine resources). Of course, it always helps to have "clean" data but then that's what Charlie Smith told me took *over 90%* of the resources when he co-founded the Energy Information Administration -- the other 10% modeling the dynamics of the energy economy. (Which, BTW, was what motivated him to finance the second neural net summer when he got control of the Systems Development Foundation starting with Werbos's work on RNNs and Hinton.) Forensic epistemology is an inescapably enormous burden that is on a continuum with "lossless compression" of whatever dataset you go with to find the dynamical systems comprising your world model. It's all lossless compression -- it's just that in the case of data cleaning, you start with raw measurements straight out of your measurement instruments (including institutions that report data), and you need to record your forensic epistemology in an algorithmic form if you want to be accountable. So, ok, here I am waiting around for "data efficiency" to make some headway despite the damage done by the likes of Yann LeCun claiming you need more and more organic data and others claiming you need more and more synthetic data, blah blah blah. Then, a few weeks ago I ran across a paper "Grokking: Generalization Beyond Overfitting On Small Algorithmic Datasets <https://arxiv.org/pdf/2201.02177>" which seemed to me to be the right approach *given* "clean" datasets. Once this approach matured, we might have some hope of formalizing what Charlie was saying about "data cleaning" and get some accountability into the machine learning world about "bias". So I told my niece who recently graduated with an emphasis in AI to look at that paper and some videos about it as an example of what I'd been trying to explain about the importance of lossless compression, "bias" and data efficiency. She got the importance but, of course, since she's running around trying to find work in AI along with a lot of other recently graduating peers, everything is all about "the latest thing". Come to find out that paper was ANCIENT .... I mean 2 years old! That's practically from IJCNN #2! However, "the latest thing" is now "grokfast <https://arxiv.org/abs/2405.20233>" which, as luck would have it, is based on code inherited from that ANCIENT paper. See the Acknowledgements at the end of the grokfast README at github <https://github.com/ironjr/grokfast>. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T2f8f4e2c54894e9b-Mbe7c97cd5fe045896833cad1 Delivery options: https://agi.topicbox.com/groups/agi/subscription
