On Sat, Mar 20, 2021 at 7:49 AM Immortal Discoveries <[email protected]> wrote: > > https://encode.su/threads/3595-Star-Engine-AI-data-compressor > > Star Engine - AI data compressor > > I named my AI after unstable stars and atoms, which gravitate in matter to > "compress it" and then once too large will extract it out as radiation to > "generate new insights". It's currently in python (~10x slower than Green, > hence ~12 hours for 100MB training), uses lots of RAM, and only outputs > binary '01010101' instead of fully compressed 'Y', but I just started > implementation and know how to fix all that. > > > EVALUATION RESULTS (compare to Hutter Prize and Large Text Compression > Benchmark champions): > 10,000 bytes in > 3,328 bytes out > Shelwien's Green: 3,453 > > 50,000 bytes in > 15,174 bytes out > Shelwien's Green: ??? > > 100,000 bytes in > 28,028 bytes out > Shelwien's Green: 29,390 > > 1,000,000 bytes in > 244,494 bytes out > Shelwien's Green: 256,602 > > 10,000,000 bytes in > [old] 2,288,646 bytes out > Shelwien's Green: 2,349,214 > > 100MB bytes in > I estimate I "can" get ~20,400,000 bytes out > Shelwien's Green: 21,819,822
Just some quick tests with zpaq on my 7 year old laptop (Intel i5 M540, 2.67 GHz, 4 GB, Windows 7 64 bit).on enwik8 (100 MB): zpaq -m1 35,691,736 bytes in 5.6 seconds. -m2 30,803,140 in 38 s. -m3 21,98,368 in 35 s. -m4 20,740,507 in 112 s. -m5 19,625,017 in 336 s. -m57 19,084,598 in 497 s. Times to compress are wall times in 2 threads compressing blocks of 65 and 35 MB independently.in parallel. -m5 total CPU time is 507 seconds. -m57 is a single block and thread. I think you said your Python program takes 12 hours. Zpaq is written in C++ but the compression engine is coded in ZPAQL, a sandboxed assembler like language that is translated into x86-64 at run time. -m1 and -m2 use LZ77 compression. -m1 uses a hash table to find matches. Matches are encoded using variable length codes. -m2 uses a suffix array to find the longest matches, so both decompress very fast. -m3 uses BWT (Burrows Wheeler context sorting transform) followed by an order 0-1 ICM-ISSE chain and bitwise arithmetic coding. Sorting by context brings together long runs of related bytes to enable low order (low memory) modeling. An ICM (indirect context model) maps a context (the previously coded bits of the current byte in the case of order 0) to a bit history, an 8 bit state representing the last several bits seen in this context. The bit history is then mapped to a probability table which is adjusted up or down by a small amount (like .001) when the actual bit is revealed. This prediction could be encoded directly, but is instead mixed with an order 1 ISSE (indirect secondary symbol estimator). This is a 2 input neuron that mixes the previous prediction with a constant 1, where the two weights are selected by the order 1 context (previous byte and previous bits of the current byte). Predictions are stretched to the logistic domain (ln(p)/ln(1-p)) before weighted averaging and squashed by the inverse function (1/(1+exp(x))) on output.to the arithmetic coder. -m4 and -m5 are context mixing models. -m4 is a simple model with an order 0-1-2-3-4-6 ICM-ISSE chain, a match model, and a final mixer. A match model looks for long context matches and predicts whatever bit came next, weighted by the length of the match. The final mixer is a neural network taking stretched predictions from all the other components and uses an order 0 context to select the weights. -m5 is a bigger model that includes some word and sparse models and a final SSE. The word model is an order 0-1 ICM-ISSE chain where the contexts are hashes of whole words instead of bytes, mapped to upper case and ignoring any spaces or punctuation in between. The sparse models skip 1 to 3 bytes to model structured data. The final mixer prediction is adjusted by an SSE, a table that takes an order 0 context and the quantized and interpolated prediction to a new probability that is adaptively adjusted. That's just an overview for text. Zpaq also deduplicates and selects different algorithms based on an analysis of the input. I describe the algorithm in more detail in http://mattmahoney.net/dc/zpaq_compression.pdf ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T7cd459770824f7b7-M7cce72748ec7f5ee1a90a7ad Delivery options: https://agi.topicbox.com/groups/agi/subscription
