I searched a bit and found that the number 0.99 came from here
https://arxiv.org/abs/1901.02860
also https://encode.ru/threads/3059-How-much-further-can-the-best-compression-go
And it actually stated 1.06 -> 0.99. And 1.06 was from 
https://arxiv.org/abs/1808.04444, which did mention cmix.
Not exactly sure if these are compressed sizes without model sizes though.
If so, after the record being broken a few more times ignoring the model size 
will likely become the new standard.
Maybe the best way to end this would be an algorithm that appears to be 
learning but actually remembers the input exactly?

At 2019-02-17 00:32:06, "Matt Mahoney" <[email protected]> wrote:

The paper mentioned improving compression of enwik8 from 0.99 to 0.93 bits per 
character but gives no details or citation. enwik8 is from my large text 
benchmark and is the test file for the Hutter prize. The current record is 
actually 1.22 bits per character and I haven't received an entry from them. I 
am on the prize committee.


text8 is a clean version of enwik8 with only lowercase letters and spaces. 
enwik8 is 100 MB of Wikipedia text with some XML formatting.


On Thu, Feb 14, 2019, 5:28 PM Robert Levy <[email protected] wrote:

https://blog.openai.com/better-language-models/

Impressive work. They're use the technique introduced in the "Attention Is All 
You Need" paper called "transformers".  See also: 
http://jalammar.github.io/illustrated-transformer/

Artificial General Intelligence List / AGI / see discussions + participants + 
delivery optionsPermalink
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T709a492ffd52fb84-Mdc9fc355877ce2639ce9837b
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to