Because it seems GPT-2/3 must be using several mechanisms like the ones that follow else it has no chance at predicting well:
P.S. Attention Heads isn't listed below, that's an important one, it can ex. predict a last name accurately by only looking at certain words regardless of all others ex. "[Jen Cath] is a girl who [has a mom] named [Tam] **Cath**" ....Tasks=something to do with where it looks, in which order, manipulation, etc... ---Compressors/MIXERS--- Syntactics: Intro: Letters, words, and phrases re-occur in text. AI finds such patterns in data and **mixes** them. We don't store the same letter or phrase twice, we just update connection weights to represent frequencies. Explanation: If our algorithm has only seen "Dogs eat. Cats eat. Cats sleep. My Dogs Bark." in the past, and is prompted with the input "My Dogs" and we pay Attention to just 'Dogs' and require an exact memory match, the possible predicted futures and their probabilities (frequencies) are 'eat' 50% and 'Bark' 50%. If we consider 'My Dogs', we have fewer memories and predict 'Bark' 100%. The matched neuron's parent nodes receive split energy from the child match. BackOff: A longer match considers more information but has very little experience, while a short match has most experience but little context. A summed **mix** predicts better, we look in memory at what follows 'Dogs' and 'My Dogs' and blend the 2 sets of predictions to get ex. 'eat' 40% and 'Bark' 60%. Semantics: If 'cat' and 'dog' both share 50% of the same contexts, then maybe the ones they don't share are shared as well. So you see cat ate, cat ran, cat ran, cat jumped, cat jumped, cat licked......and dog ate, dog ran, dog ran. Therefore, probably the predictions not shared could be shared as well, so maybe 'dog jumped' is a good prediction. This helps prediction lots, it lets you match a given prompt to many various memories that are similar worded. Like the rest above, you mix these, you need not store all seen sentence from your experiences. Resulting in fast, low-storage, brain. Semantics looks at both sides of a word or phrase, and closer items impact it's meaning more. Byte Pair Encoding: Take a look on Wikipedia, it is really simple and can compress a hierarchy too. Basically you just find the most common low level pair ex. st, etc, then you find the next higher level pair made of those ex. st+ar....it segments text well showing its building blocks. More Data: Literally just feeding the hierarchy/ heterarchy more data improves its prediction accuracy of what word/ building block usually comes next in sequence. More data alone improves intelligence, it's actually called "gathering intelligence". It does however have slow down at some point and requires other mechanisms, like the ones above. I have ~16 of these that all merge data to improve prediction.... You merge to e-merge insights Any AGI will have these.... ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T21c073d3fe3faef0-M9a05153b318f3a0b029f98bf Delivery options: https://agi.topicbox.com/groups/agi/subscription