Because it seems GPT-2/3 must be using several mechanisms like the ones that 
follow else it has no chance at predicting well:

P.S. Attention Heads isn't listed below, that's an important one, it can ex. 
predict a last name accurately by only looking at certain words regardless of 
all others ex. "[Jen Cath] is a girl who [has a mom] named [Tam] **Cath**" 
....Tasks=something to do with where it looks, in which order, manipulation, 
etc...

---Compressors/MIXERS---

Syntactics:
Intro: Letters, words, and phrases re-occur in text. AI finds such patterns in 
data and **mixes** them. We don't store the same letter or phrase twice, we 
just update connection weights to represent frequencies.
Explanation: If our algorithm has only seen "Dogs eat. Cats eat. Cats sleep. My 
Dogs Bark." in the past, and is prompted with the input "My Dogs" and we pay 
Attention to just 'Dogs' and require an exact memory match, the possible 
predicted futures and their probabilities (frequencies) are 'eat' 50% and 
'Bark' 50%. If we consider 'My Dogs', we have fewer memories and predict 'Bark' 
100%. The matched neuron's parent nodes receive split energy from the child 
match.

BackOff:
A longer match considers more information but has very little experience, while 
a short match has most experience but little context. A summed **mix** predicts 
better, we look in memory at what follows 'Dogs' and 'My Dogs' and blend the 2 
sets of predictions to get ex. 'eat' 40% and 'Bark' 60%.

Semantics:
If 'cat' and 'dog' both share 50% of the same contexts, then maybe the ones 
they don't share are shared as well. So you see cat ate, cat ran, cat ran, cat 
jumped, cat jumped, cat licked......and dog ate, dog ran, dog ran. Therefore, 
probably the predictions not shared could be shared as well, so maybe 'dog 
jumped' is a good prediction. This helps prediction lots, it lets you match a 
given prompt to many various memories that are similar worded. Like the rest 
above, you mix these, you need not store all seen sentence from your 
experiences. Resulting in fast, low-storage, brain. Semantics looks at both 
sides of a word or phrase, and closer items impact it's meaning more.

Byte Pair Encoding:
Take a look on Wikipedia, it is really simple and can compress a hierarchy too. 
Basically you just find the most common low level pair ex. st, etc, then you 
find the next higher level pair made of those ex. st+ar....it segments text 
well showing its building blocks.

More Data:
Literally just feeding the hierarchy/ heterarchy more data improves its 
prediction accuracy of what word/ building block usually comes next in 
sequence. More data alone improves intelligence, it's actually called 
"gathering intelligence". It does however have slow down at some point and 
requires other mechanisms, like the ones above.

I have ~16 of these that all merge data to improve prediction.... You merge to 
e-merge insights

Any AGI will have these....
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T21c073d3fe3faef0-M9a05153b318f3a0b029f98bf
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to