Re: [agi] Preprocessor for Hutter prize

Matt Mahoney Mon, 19 Jan 2026 13:07:39 -0800

So I have more realistic plans for my Hutter prize entry.

1. Finish step C to model open and close quotes as separate symbols because
they have different semantics and different rules for spaces. Likewise for
'' and ''' for bold and italics. Use a common symbol for the closing quotes
or brackets like [[wiki links]] or {info boxes} or ==subheadings== because
they have the same meaning, return to normal text.


2. Extend step D to 1-2 byte dictionary codes, probably 32K or 45K token
vocabulary so the language model will be 1-2 GB each for the next token and
semantic matrices. If matrix A maps tokens to next tokens, then A^t maps to
previous tokens and AA^t + A^tA maps to grammatically similar words like
Monday to Tuesday or Mother to Father. The dictionary can be organized this
way (like in the current Hutter entries) to allow partial token contexts.

3. Design a more memory efficient indirect context model for the ICM and
ISSE components. The current design needs 32 bytes for a context seen once
or 48-64 if seen 2 or 3 times. These could be implemented using only 2 or 4
bytes by saving just the last 1 or 3 bytes seen plus 1 for a counter and
hash checksum and computing the state at prediction time. They also only
need to be updated once per byte instead of per bit, so they should be
faster. This idea can be extended to more frequently seen contexts,
obviously, although with smaller savings.

4. Redesign the match model to share the input string buffer (ZPAQ makes an
unnecessary copy) and to track multiple matches to make multiple
predictions.

5. The mixer would use a 16 bit partial token context instead of 8 bits
currently.

6. Implement a short term memory model consisting of a token queue with
strength decay rates that increase with frequency and are boosted for
titles and subheadings.
Semantics is a fuzzy identity relation with reflexive, symmetric, and
transitive properties:

Reflexive: "water" predict "water" but is antireflexive at close range, so
"water water" is rare. Probability of repeating a word peaks after 50-100
bytes in my experiments.

Symmetric: "water ... wet" predicts "wet ... water".

Transitive: "water ... wet" and "wet ... rain" predicts "water ... rain".

A semantic matrix B is therefore symmetric around the diagonal (B = B^t) so
I only need to store half. BB implements the transitive property.

-- Matt Mahoney, [email protected]

On Mon, Jan 19, 2026, 2:58 PM Quan Tesla <[email protected]> wrote:

> Not for any prize, but noteworthy. The protoscientific BNUT's (a nonlocal
> unified theory) axiomatic foundations, as equations and derivations,
> compresses to 316 bits. Actually, 312 bits, arbitrarily padded to 316 bits.
> Why?Mathematically, it generally gits better overall.
>
> When compared to other unified theorems, such as 'String Theory', this
> screams "Impossible!". Yet provably, it does.
>
> There's a point in there to be noted. Turing isn't the benchmark anymore.
> As quantum theory matured, Turing has been surpassed. Why try and force 1
> more horsepower from an outdated engine? Redesign the engine to fit in with
> where the engine world's heading to.
>
> Science does indeed progress at its own pace. Turing was a genius pioneer,
> not the ultimate standard. I doubt he would've thought otherwise.
>
> Question is, are we ready for the quantum revolution about to hit us?
> Well, it is.
>
>
>
> On Fri, 16 Jan 2026, 05:24 Matt Mahoney, <[email protected]> wrote:
>
>> In other news, my Hutter prize preprocessor plus a custom ZPAQ model
>> compresses enwik9 from 1000 MB to 145 MB in 13 minutes using 4.5 GB of
>> memory, which places it near the Pareto frontier on the large text
>> benchmark.
>> https://encode.su/threads/4467-enwik9-preprocessor#post86938
>>
>> There is only a minor change to the preprocessor in step C. The steps are.
>> A - article sorting by topic.
>> B - basic XML decoding to extract text and headers into separate streams.
>> C - capitalization and space modeling and escape coding of rare
>> characters. The idea is to split the stream into tokens with independent
>> semantics. Capital letters are coded as a special character followed by
>> lower case. Then the first letter after a space is coded as upper case and
>> the space is removed.
>> D - dictionary encoding. Each of 256 byte values decodes to a common
>> group of letters found by byte pair encoding, restricted to parts of a
>> word, single digit, common punctuation, space (not all are removed), or
>> newline. This finds common suffixes like -s, -ed, -ing, etc., which are
>> tokens in themselves.
>>
>> These steps reduce enwik9 from 1000 MB to 580 MB in about 2 minutes,
>> which speeds up the downstream context model and reduces memory usage. The
>> output is then compressed with zpaqd, a ZPAQ development tool that I wrote
>> in 2014 that takes a config file that describes the context mixing
>> architecture and code in ZPAQL, a virtual machine byte code, to generate
>> the contexts. I wrote an order 0-1-2-3-4-6 byte ICM-ISSE chain, order 0-1
>> word chain, match model, and a final order 0 mixer, whose output is
>> arithmetic coded.
>>
>> An ICM is an indirect context model. It maps a context to an 8 bit state
>> representing a count of 0s and 1s and the most recent bit seen in that
>> context. That state is mapped to a table of predictions that is updated to
>> reduce the prediction error by 0.1%. An order n context means the last n
>> whole bytes plus the bits coded so far in the current byte.
>>
>> An ISSE is an indirect secondary symbol estimator. It mixes the stretched
>> previous prediction from the next lower order context with the constant 1
>> by weighted averaging and squashes the output to a prediction in the range
>> 0 to 1. The weight is selected by a hash of the current context and is
>> adjusted to reduce the prediction error by 0.1%. A prediction is stretched
>> by x = ln(p/(1-p)) and squashed by the inverse, p = 1/(1 + e^-x). This
>> makes a mixer a neural network with no hidden layer. In a word chain, the
>> context is a hash of the previous word (for order 1) and the partial word
>> bits coded so far, skipping any non letters.
>>
>> A match model searches for earlier long context matches using a hash
>> table and predicts whatever bit came next, weighted by the length of the
>> match.
>>
>> A mixer is a 2 layer neural network (no hidden layer) that weights the
>> stretched predictions from all the other components and outputs the
>> squashed weighted sum as the final bit prediction. The weights are updated
>> to reduce the prediction error. In an order 0 mixer, the weight vector is
>> selected by the order 0 context including the partial current byte.
>>
>> The current leader on the large text benchmark is nncp, at 110 MB in 60
>> hours on an RTX-3090 GPU using a transformer network. The Hutter prize
>> winner is fx2-cmix at 113 MB including the decompressor executable, limited
>> to 70 hours and 10 GB in a single thread with no GPU. It is a context
>> mixing algorithm like mine (using some of my PAQ code) but mixing many
>> hundreds of models instead of just 10.
>> https://mattmahoney.net/dc/text.html
>>
>> -- Matt Mahoney, [email protected]
>>
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> +
> delivery options <https://agi.topicbox.com/groups/agi/subscription>
> Permalink
> <https://agi.topicbox.com/groups/agi/T0518db1e3a0c25c5-Mcb0153bbcc4f6d6da2c28404>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T0518db1e3a0c25c5-M607349e4b70d21a5f20f6e64
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Preprocessor for Hutter prize

Reply via email to