On Tue, Mar 31, 2020 at 7:57 PM <[email protected]> wrote: > > New: Same 10,000,000 bytes losslessly compressed to 2,305,386 bytes. Same > code.
Some results with zpaq -m1 ... -m5 on enwik7 (first 10 MB of enwik8 or enwik9). Compression and decompression times are in seconds on a 2.53 GHz Core i5 M540, 4 GB, Windows 7. -m1 3717520 1.03 0.25 -m2 3276869 4.63 0.29 -m3 2375046 4.72 4.71 -m4 2188214 14.21 14.91 -m5 2091360 45.47 46.52 Methods m1 and m2 use LZ77. m2 spends more time to find better matches (a suffix array to find the longest match instead of a hash table) and 1 byte lookahead. m3 uses BWT followed by an order 0-1 ICM-ISSE chain for modeling. m4 and m5 use context modeling. Both have ICM-ISSE chains and whole word contexts, but 5 has additional models. BWT is a Burrows Wheeler transform, in which the bytes are sorted by context to bring similar contexts together. An ICM-ISSE chain starts with an an order 0 ICM (indirect context model), followed by ISSE predictors with increasingly long contexts. An ICM maps a context to a bit history, which is mapped to a prediction. An ISSE is a 2 input mixer (neural network) with one input from the previous component stretched prediction and the other fixed at 1. The bit history is used to select a pair of weights. The idea of a bit history is to save the last several bits in a context. If you see a sequence like 0000000001, what is the probability that the next bit is 1? If the data is stationary, then it is 1/10. But a highly adaptive model might give a higher probability. An indirect model solves this problem by saving statistics from other contexts that observed the same sequence. This paper describes the compression algorithm in more detail. http://mattmahoney.net/dc/zpaq_compression.pdf -- -- Matt Mahoney, [email protected] ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tcfc4df5e57c62b43-Mb6d2d6f575195ed8b63433f5 Delivery options: https://agi.topicbox.com/groups/agi/subscription
