[agi] Enwik9 compressor update

Matt Mahoney Mon, 02 Feb 2026 18:56:08 -0800

I released another update to my Hutter prize entry.
https://encode.su/threads/4467-enwik9-preprocessor#post87076


I kept the same 4 preprocessing steps as before and added a general purpose
context mixing compressor, an order 0-1-2-3-4-6 ICM-ISSE chain, a match
model and neural network mixer. It uses components from libzpaq, but
discarding the ZPAQL interpreter/compiler and computing contexts in C++
instead and using a hard coded model. It can still be used as a
preprocessor for other compressors as before, or as a general purpose
compressor by skipping the preprocessing steps, although the last two steps
should be useful for other text files. It compresses enwik9 to 145 MB in 14
minutes using 2.8 GB, which is near the Pareto frontier on the large text
benchmark. I modified the match model to use the uncompressed string as a
buffer instead of making a copy like ZPAQ does, saving 0.5 GB. The program
will serve as a framework for some experiments I have planned to improve
speed and memory in the context models.

I believe it is possible to code an adult level language model on a PC
within the Hutter prize limits. The human brain has a long term memory
capacity of 10^9 bits, the same as the compressed size of enwik9 and 1% of
the 10 MB memory limit. It has to be 10,000 times faster than a human,
compressing 20 years of learning into about a day. This is doable on a
neural network with 10^9 parameters because the learning rate is only 5
bits per token for 200M tokens. This means only a small fraction of
parameters need to be updated per token. Assuming a 50K word vocabulary and
8 token short term memory, then 400K parameters need to be read and updated
per token, a total of 160 trillion operations, about 2 billion operations
per second.

Spreading 5 bits over 400K parameters implies a learning rate on the order
of the inverse square root, or 0.1% per parameter, so 16 bit precision
should be sufficient. This would require 2 GB memory each for the syntactic
and semantic matrices, plus some intermediate neurons to represent parts of
speech to model things like dog = animal = noun, so that contexts like "the
dog" can be used to predict sequences like "the cat". In my earlier
experiments with ZPAQ, the minimum precision seems to be about 12 bits for
activation levels or probabilities and 20 bits for weights before
compression starts to deteriorate.

-- Matt Mahoney, [email protected]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tefdd3e588dd95259-M7e2ee7b986b0594be98011e2
Delivery options: https://agi.topicbox.com/groups/agi/subscription

[agi] Enwik9 compressor update

Reply via email to