NoOn Mon, Jul 24, 2023, 8:11 AM stefan.reich.maker.of.eye via AGI <
[email protected]> wrote:

> If compression is superior to language models, when are we getting a
> compression based chat bot? :)
>

Compression is a way to test language models. A model predicts the next
token. A compressor predicts the next token and assigns a code of length
log 1/p where p is the probability.

The Hutter prize and large text compression benchmark (LTCB) use a 1 GB
text file because that's how much a human reads and hears in a lifetime.
This should be sufficient to learn the lexical, semantic, and grammatical
structure of a language, which the top compressors do. If they don't, they
can't compress as well.

The test file is the first 1 GB of the English version of Wikipedia in
2006, which was 4 GB at the time. Now it is 85-90 GB without images, of
which 10 GB is in English. (The download is 22 GB in bzip2 format, which
has a compression ratio of 0.253 on LTCB). This is why LLMs know more than
any single human.

The previous entry was called STARLIT, an improvement over cmix that
sorted  the articles by content. This works under tight memory constraints
(10 GB) when it is not possible to store all the statistics. Another entry,
cmix-hp, improved compression by adding a larger PPM model, but did not
meet the time constraints (50 hours on an average laptop). The current
entry uses some faster data structures and cache alignment to (barely) meet
the time requirement.

Cmix is a context mixing compressor like PAQ. It uses lots of different
prediction methods to guess the next bit and combines them using neural
networks to give greater weight to the better models. PAQ uses a lot of
indirect context models which maps a hash of the current context, usually
starting on a byte or word boundary, to a bit history, the sequence of 0
and 1 seen in that context. That is then mapped by a table to a
probability, which is then updated after the outcome is known.

These predictions are mixed with other models, like PPM. This predicts at
the byte level, mapping the context to a list of bytes seen in that
context, dropping to a lower order for bytes never seen in higher orders.
PPM is more memory efficient but less flexible in the contexts it can model.

The file is preprocessed by converting to lower case and converting words
to 1-3 byte tokens using an 88K word dictionary. The dictionary models
semantics and syntax by grouping related words like "Monday" and "Tuesday",
which can be automated by clustering in context space and then hand tuning.
This makes it possible to predict related words by dropping some low order
bits of the context. All of the older Hutter entries used dictionary
preprocessing.

Finally the dictionary and the test file are compressed and appended to the
decompressor to make a self extracting archive. Different parts of the code
are compiled to optimize for size or speed. Many different people
contributed to the code making small improvements starting with the early
PAQ versions about 20 years ago.


------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T919d1153741947b5-Mf04e2091327f846fdf26f930
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to