I released another update to my Hutter prize entry. https://encode.su/threads/4467-enwik9-preprocessor#post87076
I kept the same 4 preprocessing steps as before and added a general purpose context mixing compressor, an order 0-1-2-3-4-6 ICM-ISSE chain, a match model and neural network mixer. It uses components from libzpaq, but discarding the ZPAQL interpreter/compiler and computing contexts in C++ instead and using a hard coded model. It can still be used as a preprocessor for other compressors as before, or as a general purpose compressor by skipping the preprocessing steps, although the last two steps should be useful for other text files. It compresses enwik9 to 145 MB in 14 minutes using 2.8 GB, which is near the Pareto frontier on the large text benchmark. I modified the match model to use the uncompressed string as a buffer instead of making a copy like ZPAQ does, saving 0.5 GB. The program will serve as a framework for some experiments I have planned to improve speed and memory in the context models. I believe it is possible to code an adult level language model on a PC within the Hutter prize limits. The human brain has a long term memory capacity of 10^9 bits, the same as the compressed size of enwik9 and 1% of the 10 MB memory limit. It has to be 10,000 times faster than a human, compressing 20 years of learning into about a day. This is doable on a neural network with 10^9 parameters because the learning rate is only 5 bits per token for 200M tokens. This means only a small fraction of parameters need to be updated per token. Assuming a 50K word vocabulary and 8 token short term memory, then 400K parameters need to be read and updated per token, a total of 160 trillion operations, about 2 billion operations per second. Spreading 5 bits over 400K parameters implies a learning rate on the order of the inverse square root, or 0.1% per parameter, so 16 bit precision should be sufficient. This would require 2 GB memory each for the syntactic and semantic matrices, plus some intermediate neurons to represent parts of speech to model things like dog = animal = noun, so that contexts like "the dog" can be used to predict sequences like "the cat". In my earlier experiments with ZPAQ, the minimum precision seems to be about 12 bits for activation levels or probabilities and 20 bits for weights before compression starts to deteriorate. -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tefdd3e588dd95259-M7e2ee7b986b0594be98011e2 Delivery options: https://agi.topicbox.com/groups/agi/subscription
