Iv'e been coding my own text predictors and researching them hard and I notice the key is frequencies of what usually comes next after your context [th] ex. you see [e] has high count observations. But you want the past window to be very long and grab your entail frequencies from those recognized features ex. [my dog ate the][k]. Which allows good letter prediction. GPT-2 predicts BPE words. You actually mix multiple windows because longer context matches are rarer but you can at least get some frequencies of what's seen to follow them. To help find longer matches you'd want to "translate" words cat=dog, accept different position appearance, and focus on rare words to summarize the context, so that when you look at the past 30 words you can find multiple matches in memory even though there is of course no experience exacting matching it - the alternative words, position, and filler content is in there but is similar or doesn't matter. So in the end, frequencies is running it, and even the recognition cat=dog is based on discovered shared contexts, based on frequencies. Probabilities run it and if a match is not exact then it's predictions will all get less weight.
What I show in the video appears to helps prediction by making the predictions more "similar" to the *rare story words (especially more recent words), it can look at ALL the past context. The main prediction in these algorithms however is from looking at the past ex. 3 or 20 words to get multiple "similar matches" to see what usually follows the matched contexts. You can look farther back if you 1) attend to only rare words and ignore ex. 'the', 2) can use similar words 'cat/dog', and 3) use similar position. When you know "Paris is the capital of France" and see a new prompt "The capital of France is " you predict Paris with o-k accuracy because the context mostly matches this (and a few other things in your brain), and the 2 words that are switched around exist but with similar positions. A good question is, do story words actually take their own vote on the prediction candidates? Or do we only use context matches to see what does come next? Well, if I keep adding the word 'cat' to the start of my prompt, it makes 'cat' more probable, inch by inch the probability rises, which would be unlikely that matches are finding this is what commonly follows. Below is a new video testing it out to see if the prediction is influenced from context matches solely or if it does actually use as well all story words to mindlessly vote on the next word (if the input is all cat, it's likely it will continue saying 'cat' or the similar). https://www.youtube.com/watch?v=kF8U2FD9JXc&feature=youtu.be I could try in Allen these inputs: 'bat' or 'bat bat' or 'bat bat bat', or, 'wind' or 'wind wind' or 'wind wind wind'....and no matter the word used, it will predict the same word, with more probability the more times it occurs. In the dataset it trained on is only briefly similar phrases, and I don't think they predict the same word that occurs in them. Yes my input matches them more and more because of similar words and hence the prediction will be similar, but, I don't feel out of 40GB there is enough "matches" to achieve that. Keep in mind it predicts the *same* word, you'd think 'bat bat bat bat bat bat' would match things like 'my bat saw a bird on a bat but bats fly in bat caves' etc and would often predict only similar words like cave or bird....how many matches could you get that incrementally improve the prediction of 'bat'!? Impossible. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T1f56a0e2e53cf50a-Mf608913b5a48d9b2ecec47bb Delivery options: https://agi.topicbox.com/groups/agi/subscription
