Re: [agi] new code ready

Matt Mahoney Tue, 23 Mar 2021 13:10:27 -0700

On Sat, Mar 20, 2021 at 7:49 AM Immortal Discoveries
<[email protected]> wrote:
>
> https://encode.su/threads/3595-Star-Engine-AI-data-compressor
>
> Star Engine - AI data compressor
>
> I named my AI after unstable stars and atoms, which gravitate in matter to 
> "compress it" and then once too large will extract it out as radiation to 
> "generate new insights". It's currently in python (~10x slower than Green, 
> hence ~12 hours for 100MB training), uses lots of RAM, and only outputs 
> binary '01010101' instead of fully compressed 'Y', but I just started 
> implementation and know how to fix all that.
>
>
> EVALUATION RESULTS (compare to Hutter Prize and Large Text Compression 
> Benchmark champions):
> 10,000 bytes in
> 3,328 bytes out
> Shelwien's Green: 3,453
>
> 50,000 bytes in
> 15,174 bytes out
> Shelwien's Green: ???
>
> 100,000 bytes in
> 28,028 bytes out
> Shelwien's Green: 29,390
>
> 1,000,000 bytes in
> 244,494 bytes out
> Shelwien's Green: 256,602
>
> 10,000,000 bytes in
> [old] 2,288,646 bytes out
> Shelwien's Green: 2,349,214
>
> 100MB bytes in
> I estimate I "can" get ~20,400,000 bytes out
> Shelwien's Green: 21,819,822


Just some quick tests with zpaq on my 7 year old laptop (Intel i5
M540, 2.67 GHz, 4 GB, Windows 7 64 bit).on enwik8 (100 MB):

zpaq -m1 35,691,736 bytes in 5.6 seconds.
-m2 30,803,140 in 38 s.
-m3 21,98,368 in 35 s.
-m4 20,740,507 in 112 s.
-m5 19,625,017 in 336 s.
-m57 19,084,598 in 497 s.

Times to compress are wall times in 2 threads compressing blocks of 65
and 35 MB independently.in parallel. -m5 total CPU time is 507
seconds. -m57 is a single block and thread. I think you said your
Python program takes 12 hours. Zpaq is written in C++ but the
compression engine is coded in ZPAQL, a sandboxed assembler like
language that is translated into x86-64 at run time.

-m1 and -m2 use LZ77 compression. -m1 uses a hash table to find
matches. Matches are encoded using variable length codes. -m2 uses a
suffix array to find the longest matches, so both decompress very
fast.

-m3 uses BWT (Burrows Wheeler context sorting transform) followed by
an order 0-1 ICM-ISSE chain and bitwise arithmetic coding. Sorting by
context brings together long runs of related bytes to enable low order
(low memory) modeling. An ICM (indirect context model) maps a context
(the previously coded bits of the current byte in the case of order 0)
to a bit history, an 8 bit state representing the last several bits
seen in this context. The bit history is then mapped to a probability
table which is adjusted up or down by a small amount (like .001) when
the actual bit is revealed. This prediction could be encoded directly,
but is instead mixed with an order 1 ISSE (indirect secondary symbol
estimator). This is a 2 input neuron that mixes the previous
prediction with a constant 1, where the two weights are selected by
the order 1 context (previous byte and previous bits of the current
byte). Predictions are stretched to the logistic domain
(ln(p)/ln(1-p)) before weighted averaging and squashed by the inverse
function (1/(1+exp(x))) on output.to the arithmetic coder.

-m4 and -m5 are context mixing models. -m4 is a simple model with an
order 0-1-2-3-4-6 ICM-ISSE chain, a match model, and a final mixer. A
match model looks for long context matches and predicts whatever bit
came next, weighted by the length of the match. The final mixer is a
neural network taking stretched predictions from all the other
components and uses an order 0 context to select the weights. -m5 is a
bigger model that includes some word and sparse models and a final
SSE. The word model is an order 0-1 ICM-ISSE chain where the contexts
are hashes of whole words instead of bytes, mapped to upper case and
ignoring any spaces or punctuation in between. The sparse models skip
1 to 3 bytes to model structured data. The final mixer prediction is
adjusted by an SSE, a table that takes an order 0 context and the
quantized and interpolated prediction to a new probability that is
adaptively adjusted.

That's just an overview for text. Zpaq also deduplicates and selects
different algorithms based on an analysis of the input. I describe the
algorithm in more detail in
http://mattmahoney.net/dc/zpaq_compression.pdf

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T7cd459770824f7b7-M7cce72748ec7f5ee1a90a7ad
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] new code ready

Reply via email to