Feel free to download ZPAQ and do your own experiments. A simple order 3 CM
looks like "c100.0.255.255.255". 100 is the max count / 4, or 0 for an ICM.
Smaller numbers (1-255) adapt faster. The 0 means no special context. The
255s are bit masks on the last 3 bytes of context. The complete option
would be -ms4.0c100.0.255.255.255. -m means method. s means streaming
format. 4 means 2^4 MB blocks compressed independently in separate threads.
0 means no preprocessing. To mix order 2 and 3 use
-ms4.0c100.0.255.255c100.0.255.255.255m with a final mixer.

I did not implement a semantic model. That would be a short term memory of
a list of the last several tokens with low frequency tokens persisting
longer. These would be individual contexts to predict the current token. I
have no doubt it would improve compression but I didn't write the code. If
you want a order 0-1 ICM-ISSE word chain, the option is w1.65.26.224, where
1 is the chain length, 65 is the ASCII code of the start of the alphabet
(A), 26 is the alphabet size, and 224 is the bit mask used on the letters
before hashing, which has the effect of converting to upper case. Any other
bytes start a new word. Orders higher than 1 don't help much.

-- Matt Mahoney, [email protected]

On Fri, Dec 19, 2025, 10:22 PM <[email protected]> wrote:

> Here's the last of my questions/replies for now:
>
> There's a not-so-obvious very important thing I learnt to make gaps work
> this good (in my code, see the list called "hasonce", and re-read my
> encode.su new description).
>
> Btw does your algorithm skip spaces only? Or only half byte gaps? If yes,
> why only half byte gaps? Or is it no gaps at all ?...
>
> And let me clarify my other question (you didn't answer). Do you do the
> following? It seems like it might help. Let's say we have "[The cat] was on
> the floor, I got up and [heard a] ____". (meow is predicted by [I heard a],
> but [The cat] at the start also might be able to help because it predicts
> meow and could help direct the current window more to lean towards
> predicting "meow").
>
> Also I don't think you ran what I wanted to see, what is your result for
> the following:
> "2024pre-processed enwik5.txt"
> no ICM / SSE / ISSE
> only 1 order (the last 1 byte (or 2 if you must, but it can't be the last
> bit because it won't be similar to my setup's results))
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> +
> delivery options <https://agi.topicbox.com/groups/agi/subscription>
> Permalink
> <https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M76e08ae81bd9f8c3c4fbc770>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M685909a92f177aa021121095
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to