NoOn Mon, Jul 24, 2023, 8:11 AM stefan.reich.maker.of.eye via AGI < [email protected]> wrote:
> If compression is superior to language models, when are we getting a > compression based chat bot? :) > Compression is a way to test language models. A model predicts the next token. A compressor predicts the next token and assigns a code of length log 1/p where p is the probability. The Hutter prize and large text compression benchmark (LTCB) use a 1 GB text file because that's how much a human reads and hears in a lifetime. This should be sufficient to learn the lexical, semantic, and grammatical structure of a language, which the top compressors do. If they don't, they can't compress as well. The test file is the first 1 GB of the English version of Wikipedia in 2006, which was 4 GB at the time. Now it is 85-90 GB without images, of which 10 GB is in English. (The download is 22 GB in bzip2 format, which has a compression ratio of 0.253 on LTCB). This is why LLMs know more than any single human. The previous entry was called STARLIT, an improvement over cmix that sorted the articles by content. This works under tight memory constraints (10 GB) when it is not possible to store all the statistics. Another entry, cmix-hp, improved compression by adding a larger PPM model, but did not meet the time constraints (50 hours on an average laptop). The current entry uses some faster data structures and cache alignment to (barely) meet the time requirement. Cmix is a context mixing compressor like PAQ. It uses lots of different prediction methods to guess the next bit and combines them using neural networks to give greater weight to the better models. PAQ uses a lot of indirect context models which maps a hash of the current context, usually starting on a byte or word boundary, to a bit history, the sequence of 0 and 1 seen in that context. That is then mapped by a table to a probability, which is then updated after the outcome is known. These predictions are mixed with other models, like PPM. This predicts at the byte level, mapping the context to a list of bytes seen in that context, dropping to a lower order for bytes never seen in higher orders. PPM is more memory efficient but less flexible in the contexts it can model. The file is preprocessed by converting to lower case and converting words to 1-3 byte tokens using an 88K word dictionary. The dictionary models semantics and syntax by grouping related words like "Monday" and "Tuesday", which can be automated by clustering in context space and then hand tuning. This makes it possible to predict related words by dropping some low order bits of the context. All of the older Hutter entries used dictionary preprocessing. Finally the dictionary and the test file are compressed and appended to the decompressor to make a self extracting archive. Different parts of the code are compiled to optimize for size or speed. Many different people contributed to the code making small improvements starting with the early PAQ versions about 20 years ago. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T919d1153741947b5-Mf04e2091327f846fdf26f930 Delivery options: https://agi.topicbox.com/groups/agi/subscription
