GPT-2 also uses frequency for sure for next word probabilities. And probably something equivalent to SSE/SEE (Secondary Symbol Estimation) which is used in the best lossless compressors and shaves off 22MB to 20MB, world record is 14.8MB for 100MB input.
Also the best matched strings are mixed, "the dog" activated "the dog"..."he dog"...."t dog"......"this cat"...."cat this"...."but cats"....they are similar and some are longer and some are more activated - especially longer ones. So the predictions are voted on based on frequency, combined energy transfer especially from longer/ more activated strings, and one day reward too, and if the frequency etc and the total count of diverse predictions is too low then it Backoffs a string length down to get higher frequency etc and so the shorter matches mat get 40% needed and the first set mixed got 10%, you mix these sets of prediction probabilities together. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T0607d3f3f3678b2f-Mebcf09a30e09d161fc96c751 Delivery options: https://agi.topicbox.com/groups/agi/subscription
