Feel free to download ZPAQ and do your own experiments. A simple order 3 CM looks like "c100.0.255.255.255". 100 is the max count / 4, or 0 for an ICM. Smaller numbers (1-255) adapt faster. The 0 means no special context. The 255s are bit masks on the last 3 bytes of context. The complete option would be -ms4.0c100.0.255.255.255. -m means method. s means streaming format. 4 means 2^4 MB blocks compressed independently in separate threads. 0 means no preprocessing. To mix order 2 and 3 use -ms4.0c100.0.255.255c100.0.255.255.255m with a final mixer.
I did not implement a semantic model. That would be a short term memory of a list of the last several tokens with low frequency tokens persisting longer. These would be individual contexts to predict the current token. I have no doubt it would improve compression but I didn't write the code. If you want a order 0-1 ICM-ISSE word chain, the option is w1.65.26.224, where 1 is the chain length, 65 is the ASCII code of the start of the alphabet (A), 26 is the alphabet size, and 224 is the bit mask used on the letters before hashing, which has the effect of converting to upper case. Any other bytes start a new word. Orders higher than 1 don't help much. -- Matt Mahoney, [email protected] On Fri, Dec 19, 2025, 10:22 PM <[email protected]> wrote: > Here's the last of my questions/replies for now: > > There's a not-so-obvious very important thing I learnt to make gaps work > this good (in my code, see the list called "hasonce", and re-read my > encode.su new description). > > Btw does your algorithm skip spaces only? Or only half byte gaps? If yes, > why only half byte gaps? Or is it no gaps at all ?... > > And let me clarify my other question (you didn't answer). Do you do the > following? It seems like it might help. Let's say we have "[The cat] was on > the floor, I got up and [heard a] ____". (meow is predicted by [I heard a], > but [The cat] at the start also might be able to help because it predicts > meow and could help direct the current window more to lean towards > predicting "meow"). > > Also I don't think you ran what I wanted to see, what is your result for > the following: > "2024pre-processed enwik5.txt" > no ICM / SSE / ISSE > only 1 order (the last 1 byte (or 2 if you must, but it can't be the last > bit because it won't be similar to my setup's results)) > *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + > delivery options <https://agi.topicbox.com/groups/agi/subscription> > Permalink > <https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M76e08ae81bd9f8c3c4fbc770> > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tf0bedfcd44454678-M685909a92f177aa021121095 Delivery options: https://agi.topicbox.com/groups/agi/subscription
