The numbers I used come from my paper on the cost of AI, http://mattmahoney.net/costofai.pdf (published in https://www.cambridgescholars.com/product/978-1-5275-0000-6 ). Tom Landauer (1986) studied human long term memory capacity and estimated 10^9 bits based on memory recall tests for words, numbers, pictures, and music clips. The picture recall tests had people view 10,000 photos for 5 seconds each over 2 days and tested recall a week later with a mix of old and new photos. The results suggest we remember about 30 bits from each image, or 6 bits per second, about the same rate that we remember text. Shannon (1950) used human text prediction tests to estimate that written English has an entropy of 0.6 to 1.3 bits per character, which agrees with the best compressors on my large text benchmark around 0.9 bpc. Compression ratio improves with size of the model and training set, as can be clearly seen in the "LTCB size vs memory" graph in http://mattmahoney.net/dc/text.html
Turing (1950) estimated that it would take 10^9 bits of memory, and components no faster than current technology (I assume in parallel) to win the imitation game. He did not explain his reasoning, but I do here. http://mattmahoney.net/dc/rationale.html You learn language by example, and over a lifetime you get about 1 GB of examples. Any knowledge you need to pass the Turing test can be learned from just text. I described all of this more than a decade before we had any LLMs passing the Turing test. I estimate there are 10^17 bits of human knowledge. That is 10^9 bits per person in long term memory x 10^10 people x 0.01 fraction of knowledge known only to you and not written down. I estimate that is equal to 10^19 bytes of text at 0.01 bits per character. That's a million times bigger than any language model to date. I don't claim that LLMs are using the optimal ratio of training set size to number of parameters. Information theory suggests they should be equal. But computation is expensive. For example, Falcon 180B (https://thesequence.substack.com/p/falcon-180b-takes-open-source-llms ) has 180B parameters trained on 3.5T tokens. This would be optimal if the compression ratio were 0.01 bpc assuming 5 characters per token. Or maybe the training data is more than they need and they are optimizing cost. It took 7M GPU hours to train on 4096 GPUs. An A100 GPU with 6K Cuda cores and 80 GB of memory with 2 TB/s bandwidth costs about $15,000 or can be rented on Azure for $3 per GPU hour. So just the training costs $20M to rent or $60M to buy. The fact that neural networks work so well on so many different AI problems suggests we do have a good understanding of how our neurons compute. Of course there is no biological model of back propagation, but there doesn't have to be. The general algorithm is to adjust anything that can be adjusted in the direction that reduces error. You could even train a randomly connected network with fixed weights just by adjusting thresholds. Natural language has a structure that allows it to be learned one layer at a time, making back propagation not needed. We probably learn by updating weights in the hippocampus. But that can only hold a few days of memories. So at night, we randomly sample these memories and make long term copies in the cerebral cortex by growing axons and physically connecting neurons. That's just one theory of why we need to sleep and dream. It doesn't mean we need to design an LLM that way. On Mon, Sep 11, 2023 at 6:02 AM mm ee <[email protected]> wrote: > > Again, no reason to tie 1 neuron to 1 memory or even pattern of information, > and I am strictly against comparing the brain to artificial neural networks, > as there has been no evidence that the facsimiles present in ANNs are in any > way reflective of what is actually happening in living brains. The most I > have seen is that if you bend over backwards and recite Infinite Jest 9 times > while squinting really hard, it *kind of* looks like the brain can do a form > of backpropagation if you forced it to. > > Even given the comparison to ANNs, you cannot model non-linearity in a single > perceptron model unless said perceptron uses a polynomial activation function > (which neither brains nor standard ANNs use); to imply a single perceptron > could practically encode a pattern of a memory seems absurd. > > On the point of compression, no idea where you arrived at that figure - as > per Chinchilla scaling laws, it has been known for some time the optimal data > : parameter ratio is far off what GPT3 used. Llama 2 has a 70B parameter > variant trained on [IIRC] 2 trillion tokens and outperforms GPT3. Doesn't > make any sense to derive compression ratios based on how OpenAI or Meta > decided to train their systems, for all we know even Chinchilla is > suboptimal, so no idea how you arrive at 0.1 bpc, and definitely no idea how > you arrived at the figure that a human brain stores 10^19... "characters" of > information?? > > I wouldn't call language "figured out". LLMs are impressive, neural networks > have incredible use cases, but they are still far from understanding language > on any human-like level. Just as is the case with image generation models, > they can definitely produce something that looks like art if you squint, > until you look a little closer and realize there's one too many hands.. > > On Sun, Sep 10, 2023, 6:09 PM Matt Mahoney <[email protected]> wrote: >> >> On Sat, Sep 9, 2023, 5:57 PM mm ee <[email protected]> wrote: >>> >>> There is no reason to believe that 1 bit or 1 synaptic connection >>> corresponds to a single pattern of a memory >> >> >> Not one synapse, one neuron. Human memory is associative. Synapses represent >> associations between concepts, at least in the connectionist model that >> makes neural networks easy to understand. But connectionism doesn't have a >> mechanism for learning new concepts and adding neurons. We solve the problem >> by having neurons represent linear combinations of concepts and synapses >> represent linear combinations of associations. >> >> A rule of thumb for programming neural networks is to use on the order of 1 >> synapse or weight or parameter per bit of compressed training data. Too >> small and you forget. Too big and you over fit. GPT3 uses 175B parameters to >> train on 500B tokens of text, suggesting a compression ratio of 0.1 bits per >> character. A human level language model in theory should need 1B parameters >> to train on 1 GB of text at 1 bpc. ChatGPT knows far more than any human >> could remember. And compression ratio gets better as the training set gets >> bigger. All of human knowledge is 10^19 characters compressing to 10^17 bits >> at 0.01 bpc because 99% of what you know is shared or written down (why it >> costs 1% of lifetime income to replace an employee). >> >> The mystery is why does the brain need 6 x 10^14 synapses to store 10^9 bits >> of long term memories? Maybe because neurons are slow so you make multiple >> copies of bits to move them closer to where they are needed. Like a server >> farm stores 1M copies of Linux on disk, RAM, cache, and registers. Or your >> body has 10^13 copies of your DNA and still has to make multiple copies of a >> gene to mRNA before transcribing it. >> >> So if we can optimize LLM storage by using faster components, maybe we can >> do the same for vision at video speeds. We know that we can only store >> visual information at 5 to 10 bits per second, same as language. We figured >> out language by abandoning symbolic reasoning and training semantics before >> grammar. Maybe we can solve vision by modeling a fovea and eye movements. > > Artificial General Intelligence List / AGI / see discussions + participants + > delivery options Permalink -- -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tc1bcda5fdb4147f4-Ma0203587a2b3ba5a4b612571 Delivery options: https://agi.topicbox.com/groups/agi/subscription
