That's progress. If you would like me to add it to the large text
benchmark, then test on enwik9, report compress and decompress time and
memory usage and your test hardware. You should be able to pass the input
and output file names as command line arguments in Python and save in a
binary format. But really you should write it in C or C++ for better speed.

I think the algorithm you are describing is PPM with arithmetic coding. The
probabilities are weighted to favor longer contexts and recent matches.

On Sat, Mar 20, 2021, 7:49 AM Immortal Discoveries <
[email protected]> wrote:

> https://encode.su/threads/3595-Star-Engine-AI-data-compressor
>
> Star Engine - AI data compressor
>
> I named my AI after unstable stars and atoms, which gravitate in matter to
> "compress it" and then once too large will extract it out as radiation to
> "generate new insights". It's currently in python (~10x slower than Green,
> hence ~12 hours for 100MB training), uses lots of RAM, and only outputs
> binary '01010101' instead of fully compressed 'Y', but I just started
> implementation and know how to fix all that.
>
>
> EVALUATION RESULTS (compare to Hutter Prize and Large Text Compression
> Benchmark champions):
> 10,000 bytes in
> 3,328 bytes out
> Shelwien's Green: 3,453
>
> 50,000 bytes in
> 15,174 bytes out
> Shelwien's Green: ???
>
> 100,000 bytes in
> 28,028 bytes out
> Shelwien's Green: 29,390
>
> 1,000,000 bytes in
> 244,494 bytes out
> Shelwien's Green: 256,602
>
> 10,000,000 bytes in
> [old] 2,288,646 bytes out
> Shelwien's Green: 2,349,214
>
> 100MB bytes in
> I estimate I "can" get ~20,400,000 bytes out
> Shelwien's Green: 21,819,822
>
>
> NEXT LETTER PREDICTION RESULTS (compare to size of data that would be
> needed to be able to cheatingly get the subjectively correct enfollowing
> 500 letters for some given prompt):
> FOR 10,000 BYTES TRAINED ON:
> The girl was sitting on Wikiquot;[http://www.lewrockwell|Ramp>
> <contributor>
> <text xml:space="preserve">#REDIRECT [[AlECT [[AREDIRECT [[Acce
> key="8">MediaWiki talk</namespace>
> <namespace>
> <namespace key="-1"-1">Template talk</namespace>
> <namespace key="15">C51ist society might wom and
> prediawiki.org/xml/export-0.3/" '' moChmlers<potkin|Kropotkin]],
> PeternSChmler, cht w[s0��xpace>
> <namespace key="12"1:/timestamp>2002-02002-02-25T15T15">Wikts. [[Bertrand
> chietikte Wtrand conal[http://uk.end
> </page>
> <page>
> </revision>
> </page>
> <namespace key="geri.3<c<page>
>
>
> FOR 100,000 BYTES TRAINED ON:
> The girl was sitting on they can confunce (non--&gt;, with this surelCatd,
> mak.]
> The characteristics set maki.org/ Poccurs in the [[M.
>
> It act Lam, ''unism==
> {{main|150px|[[hu:Anarchism]]
> [[sl:space="preserve">#REDIRECT [[Fory/fEDIRECT [[Afrom the [[Max
> Stirner]], but be givities}}
>
> ==The [[scienti. The authoritarian ar impain when he overl legration that
> if regoing (189898952</id>
> </contributor>
> </contributor>
> <username>Ams</username>
> <id>15898948</username>Ams</username>Josed. of nexchange example, the
> first manifests t893>A�xinitially preferentify the many ecles|[[Chich ce
> 19999|Wizely understand me>
> <id>7543</id>
> </contributor>
> <minor />
> <contributor>
> <ip>Conversion script</ip>
> <namespace key="1">Talk</namespace>
>
>
> FOR 1,000,000 BYTES TRAINED ON:
> The girl was sitting on [[copper]] or [[Zeno &quot;repudiated the
> omnipotence of 0
> | align=&quot;right&quot; assumedia.org: The [[bar (lawk=��q.f|Melillage
> of 14, Andre plays. Par-TV Jaskirport<Plts for its variants from by
> Shrugged imperiod of Atlas Shrugged|section]] 152.
>
> ==San Sebastian Minese: 陳��M.�ju.jpg|thumb|left|Statue of Ayn Rand]]
> [[gl:Astrongly replicated by one.
>
> E5*t)#REdoct, rather pervasive death in tre">{|20010
> |90 MHz took him for deity asks for in the South Pacific]]'' (glor
> accumulated &quot;The Book)]], [[Alfreducation system is afa)
> * [[PurgBifferency_code=197,�на]]
> [[an:Austria]]
> [[als:Archeologie]]
> [[ru:Арія (крия]]
> [[zh-min-nan:Oscar Chióng]]
> [[da:Austria (geography of reconstruction:Oscar Christians appeared
> somethings said to have travel taken from 1
> |Colorado]]. The lowere click, On said to have been effective
> values.&quot; | 60 Metallurgy]]) [[twe_oxaxU.S. state]] Science]]s while
> that Redge talleged|sections]] 121 and 161.
>
> ==BC]]]]
> {{main|Anarchow has energy university of Povertyle [[Tih
>
> [[Hollywood]] was interesting
>
>
> Code Use: Place in a folder "code".py, input1.txt, input2.txt,
> compressed.txt. Run code at desired length of data to eat, it tells you at
> bottom how many bytes it is compressed, switch to input input2.txt and run
> length ex. from 10000 to 9980, run, check decompressed.txt. For generation
> mode, toggle the words generate in the decode section, if on you simply run
> the code for ex. 100,000 runs and if the file is 10,000 letters long with
> your prompt at bottom it'll see end of file and start extending the file in
> decompressed.txt by 90,000 letters.
>
>
> How it works: A tree stores all 16/ 15/ etc letter-long strings as it runs
> over a dataset, exact matches are stored as counts. Before I save, I search
> for the last 16, 15, etc letters I see, I take the longest match found in
> tree and see what children leaves were saw to follow and their count
> probabilities. I take each letter prediction's counts and divide it by
> total counts for this layer to get softmax %s ex. 0.4 a, 0.6 b, so that add
> up to 1.0. Long matches are more accurate but have less counts, so I only
> get some weight for this layer, and if I have only total 30 counts but only
> 3 possible next letters seen follow then I am more sure I know the
> distribution, so this layer gets more weight, I do lengthOfPredictionSet *
> 7 to get the wanted roof of counts to be confident, then divide to get the
> % this layer gets. If it gets 30%, I have 70% left to find in short context
> matches. I also give hardcoded static weight partially since I must have
> not cracked the formula. lowest layer is no context set of predictions,
> simply how common each letter is. I apply exponential function to layer
> predictions, and blending layers (which), so it pools thinking if it's 0.6%
> then it's probably 7.2%, and if 9.9% probably 9.4%, same for other end;
> 0.1. Energy is used for recency, if I'm mixing layer 8 ATM then I check the
> last 300 letters for the latest 8 letters and make a set of predictions
> that follow, for this temporary set I give more recent predictions more
> count, and if I just saw 'p' 1 or 2 letters ago then I don't predict 'p' as
> much, but do lots after ~3 letters. For compression evaluation, I take my
> final set of predictions and subtract from a high 1.0 until find the
> prediction I'd need to remake the file, the better my AI predicts the next
> letter the less cost it is to steer it to the correct prediction to remake
> the file. Once I subtracted from high 1.0, I also have a low 0.0 and the
> space before last subtraction ex. 0.7 to 0.65, this is my new high and low,
> repeating this gives me a very long number ex. 0.8456346856.... As I make
> the number I carry away and store locked digits ex. 0.[763]73 high and low
> 0.[763]112, at the end I store just one number that is inbetween. This long
> number is converted to binary then supposed to be compressed to letters ex.
> 5Ge8$9&(gf@3Nfy. An extra set is in prediction sets in case any unseen
> letter needs to be steered to. Decompression uses nearly the same code used
> for compression.
>
>
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> + delivery
> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/T7cd459770824f7b7-Med278096034dfca59dc9409a>
>

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T7cd459770824f7b7-M7586a297b68c168ba00d66b3
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to