You are describing PPM. Here are some results with some other PPM
compressors on enwik7 (10 MB).

ppmd e -o16 -m256 enwik7 -> 2289369, 3.6 seconds
ppmonstr e -o16 -m256 enwik7 -> 2101368, 14.4 seconds

-o16 means use an order 16 context tree to gather statistics (like
yours). -m256 limits memory usage to 256 MB.

PPM predicts one byte at a time and codes the result using arithmetic
coding. It starts with the longest context (up to 16) with at least
one match and falls back to shorter contexts until it has a
distribution over all 256 possible next bytes. The main difficulty is
the zero frequency problem, how much probability space to allocate to
the next lower level for all the byte values not predicted at the
current level. For example, if in the current context you have 5
matches with A, A, A, B, C as the next bytes, then you assign A 3
times the probability as B or C. But how much for the rest of the
alphabet? ppmd and ppmonstr use an adaptive approach to learn this
value based on various small contexts. Details are in my book:
http://mattmahoney.net/dc/dce.html#Section_422

I also suggest learning C and C++. Python is interpreted, so very slow.

On Sun, Apr 5, 2020 at 10:14 PM <[email protected]> wrote:
>
> Is it ok if I explain it as bad as the expert on this page?:
> https://encode.su/threads/3303-Is-this-how-to-practically-implement-Arithmetic-Coding/page7
>
> Unfortunately there is very few full clear explanations of these incredible 
> algorithms. I could make a better compressor if someone intuitively explained 
> Green, SSE, and SEE.
>
> My algorithm has a 17 letter long window step along the input file 1 letter 
> at a time , updating a tree as it sees new data. The tree's branches are 17 
> nodes long, and updates node counts if passes any node. For each step the 
> window takes, the algorithm searches the tree for 17 different searches each 
> a letter longer. The children leafs (the final letter of a searched branch) 
> are the predictions with counts seen so far in the file. Layer 1 nodes are 
> children too and need no match. The tree is storing the frequency of all 
> 1/2/3.../17 letters seen so far. The children are what allows you to 
> predict/compress the next letter accurately. These 17 sets of predictions 
> must be mixed because while the longest set is more accurate - we have less 
> statistics, sometimes only 2 counts. We start with the longest found. Ex. 14 
> letter match in the tree. The 14th set of predictions may say it seen come 
> next a=44, b=33, f=25, w=7. I sum a set's counts up to get a total of (in 
> this case) 109, then I divide each count by the total to get %s that all add 
> up to 1% ex. 0.404% 0.35%.... Now for all these predicted %s, we still have 
> 13 sets to mix and must remove some % from them each. So what I do is I check 
> the total counts of the set against a Wanted Roof ex. 109<>300 (maybe we 
> don't even need to mix lower sets if we got enough stats), and so I cut each 
> % of each prediction by about 1/3rd then in this case. And in this case we 
> still desire 66% more stats. For the next set, if say we have 200<>300, I 
> take away 2/3rds from the 66% - meaning we still desire 22%, not 66% - 2/3rds 
> = 0%! I take away the % got OF the % still desired. A little bit of lower 
> sets always leak in therefore, which is better because we can never be sure 
> even if surpass Roof by lots. Besides, it gave better results. But Roof is 
> decided by how many predicted symbols are in the set (total unique symbols 
> being predicted), so if i have 2 then Roof may be 8 counts wanted. Also, 
> while the Roof is based on how many different symbols are seen in the set, we 
> get a slightly different Roof if we are on the ex. 5th set, i.e. if we have 4 
> letters in the set #14 then Roof is ex. 33, but if it is set #5 then Roof is 
> ex. 26. Also, based on the Roof's size, a curve's bend is modified. This 
> curve gives small/large total counts in a set an even smaller/larger total 
> (but it isn't used in the Arithmetic Coding, it's only used for deciding how 
> much % this set gets in our mixer). This is meant to be a exponential 
> activation. Finally a global weight is given to each set ex. the 14th set is 
> always given 0.7% of the weight it was going to get lol. I hardcoded the 
> numbers for now but the code isn't grossly large of course. If they were 
> adaptive and were based on the data then the compression would be even 
> better. I just noticed I do exit the mixing before reach lower sets if the 
> Roof is ever surpassed, I'll have to test if this is useful. The Arithmetic 
> Coder takes the combined sets i.e. the prediction %s are combined a, b, c + 
> a, b, c + a, b, c ..... = a, b, c (and now all the predictions add up to 1% 
> i.e. a, b, c = 1%), and the AC then takes a high and low bound 1-0 and takes 
> the middle between the high and low, and starts misusing each % of the set, 
> until matches the final letter in the window (same process whether compress 
> or decompress). So say we stop once reach b in our set ex. a, [b], c, we are 
> in the float precision now of ex. 0.45-0.22. WE take middle again (0.23) and 
> start misusing (once the window on the file takes another step. The encoding 
> decimal keeps getting more precise, storing the whole file. To work in 16 
> byte float we need to carry away locked digits, meaning if the high and low 
> are both now 0.457594-0.458988, we store '45' and get now 0.7594-0.8988, and 
> we are going to be taking the middle of these 2 to make the decimal more 
> precise then. This long decimal is then stored as a binary bin number ex. 
> 6456453634636=10100011100111010011. I didn't implement the removing same 
> counts from lower sets that are just from the higher set, because it hurt 
> compression, i.e. if there is 9 counts total in set 3 and 99 total in set 2, 
> 9 of the counts in set 2 are the same observations and 'should' not help us 
> reach Roof. I'll look into it more. Lastly, escape letters, my first set we 
> mix is a dummy set that has super small weight and has every possible letter, 
> in case we need to encode/decode one and hasn't yet seen it in the file, 
> hence requires a small room in the AC high low bounds. I also hardcoded each 
> probability in this dummy set, common letters get more weight. 
> Compression/decompression takes 2 hours and 16 minutes for 10MB, but Python 
> is slower. Ram is fairly big because I didn't implement the pruning.
> Artificial General Intelligence List / AGI / see discussions + participants + 
> delivery options Permalink



-- 
-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tcfc4df5e57c62b43-M1206bd7862a4bedd77a5b5c3
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to