@Matt and others, is this correct?

After analyzing http://mattmahoney.net/dc/dce.html#Section_58 and other areas 
on the site, and looking at code etc, I have some new understanding. The code 
the participants made is small (~0.5MB or even 0.04KB sometimes) compared to 
the 100MB they compress. The algorithm eats the 100MB and then spits out ~15MB. 
Most the 15MB is error correction padding (by using Arithmetic Coding) to 
swerve the prediction probability the right way to get lossless data back, 
while the little rest of the compressed data left is actually the model/weights 
(Arithmitic Coded I suppose) that is a 'really good predictor' yet 'really 
small'. Here's why. Just looking at order-0 reallllly helps (just frequency of 
single letters in the 100MB ex. a=54745747, b=67457777, etc), order-1 n-grams 
are a lot better (for past context [h], 'e' is most likely), and so on 
exponentially less better. Eventually order-100 has sparse tidbits useful but 
far back and here the code gets larger omg but maybe that's hand crafted rules 
I'm seeing. One may think they are done after looking at just 2-ngrams and 
adding error padding because it is compressed so much ex. 19MB, but far from 
correct: It's easy to shrink the 100MB into 30MB, just Arithmetic Coding alone 
*can* get it to 50MB or 35MB and that's super easy. This last section (15MB to 
10MB) is realllllly hard for a smallllll increase in compression yet realllllly 
important for identifying intelligence. Therefore the Hutter Prize should be 
very interested in small improvements as it gets lower. The error padding uses 
Arithmetic Coding, which solves the coding problem (best smallest language to 
talk in) and is optimal for swerving the predictions the right way because the 
model's predictions are mostly correct (on average) as said and only need a 
little push. The model looks at order-n ngram frequency/probability (the short 
term nearby past context), and other contexts, like order-0, order-5, order-7, 
order3.6.1 (sparse ex. "the [cat] ate [t]he [new] fo[od on the] _"). The 
shorter orders are more Long Term Learning. It overlays these contexts and 
averages the predictions. This solves solves the probability problem. Now, for 
similar words/codes and position this is subtracted from these probabilities. 
This solves the similarity problem. And lastly, the Next Letter/Bit is 
predicted based on also relation to rare story contexts. Before I though 
context is looked at, then frequency adds probability to the Next 
Bit/Letter/Code to predict. But we look at many contexts/orders (to get 
multiple probabilities), all overlaying each other, and average them, maybe we 
can store them averaged instead. So maybe some of the steps I mentioned are the 
same thing.
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T65747f0622d5047f-M59ccd26ae3553b77991f1b23
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to