@Matt and others, is this correct?
After analyzing http://mattmahoney.net/dc/dce.html#Section_58 and other areas on the site, and looking at code etc, I have some new understanding. The code the participants made is small (~0.5MB or even 0.04KB sometimes) compared to the 100MB they compress. The algorithm eats the 100MB and then spits out ~15MB. Most the 15MB is error correction padding (by using Arithmetic Coding) to swerve the prediction probability the right way to get lossless data back, while the little rest of the compressed data left is actually the model/weights (Arithmitic Coded I suppose) that is a 'really good predictor' yet 'really small'. Here's why. Just looking at order-0 reallllly helps (just frequency of single letters in the 100MB ex. a=54745747, b=67457777, etc), order-1 n-grams are a lot better (for past context [h], 'e' is most likely), and so on exponentially less better. Eventually order-100 has sparse tidbits useful but far back and here the code gets larger omg but maybe that's hand crafted rules I'm seeing. One may think they are done after looking at just 2-ngrams and adding error padding because it is compressed so much ex. 19MB, but far from correct: It's easy to shrink the 100MB into 30MB, just Arithmetic Coding alone *can* get it to 50MB or 35MB and that's super easy. This last section (15MB to 10MB) is realllllly hard for a smallllll increase in compression yet realllllly important for identifying intelligence. Therefore the Hutter Prize should be very interested in small improvements as it gets lower. The error padding uses Arithmetic Coding, which solves the coding problem (best smallest language to talk in) and is optimal for swerving the predictions the right way because the model's predictions are mostly correct (on average) as said and only need a little push. The model looks at order-n ngram frequency/probability (the short term nearby past context), and other contexts, like order-0, order-5, order-7, order3.6.1 (sparse ex. "the [cat] ate [t]he [new] fo[od on the] _"). The shorter orders are more Long Term Learning. It overlays these contexts and averages the predictions. This solves solves the probability problem. Now, for similar words/codes and position this is subtracted from these probabilities. This solves the similarity problem. And lastly, the Next Letter/Bit is predicted based on also relation to rare story contexts. Before I though context is looked at, then frequency adds probability to the Next Bit/Letter/Code to predict. But we look at many contexts/orders (to get multiple probabilities), all overlaying each other, and average them, maybe we can store them averaged instead. So maybe some of the steps I mentioned are the same thing. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T65747f0622d5047f-M59ccd26ae3553b77991f1b23 Delivery options: https://agi.topicbox.com/groups/agi/subscription
