Re: [agi] The limitations of the validity of compression.

Matt Mahoney Sun, 22 Mar 2020 13:45:47 -0700

On Sun, Mar 22, 2020, 8:57 AM <[email protected]> wrote:


> 1 more question Matt:
> https://openai.com/blog/better-language-models/
> They say "enwik8 - bits per character (–) - OURS: 0.93 - LAST RECORD 0.99"
> But....but...enwiki8 is 100MB! 0.99 alone is 100MB / 8 = 12.5MB. The best
> compression is 14.8MB though. What are they doing here? 0.93 is 11,625,000
> bytes.
>

I don't know where OpenAI got those numbers. They certainly aren't mine.

Also, perplexity is the same as compression. The conversion is
Perplexity = 2^(bits per word).

Some language modelers improve their numbers by removing punctuation and
capitalization, splitting words into a root and suffix, and mapping rare
words outside a 20K word dictionary to a common token. In data compression
we consider that cheating.


------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T2a0cd9d392f9ff94-M31d8f5c1210dc5cd98b32d87
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] The limitations of the validity of compression.

Reply via email to