On Sun, Mar 22, 2020, 8:57 AM <[email protected]> wrote:
> 1 more question Matt: > https://openai.com/blog/better-language-models/ > They say "enwik8 - bits per character (–) - OURS: 0.93 - LAST RECORD 0.99" > But....but...enwiki8 is 100MB! 0.99 alone is 100MB / 8 = 12.5MB. The best > compression is 14.8MB though. What are they doing here? 0.93 is 11,625,000 > bytes. > I don't know where OpenAI got those numbers. They certainly aren't mine. Also, perplexity is the same as compression. The conversion is Perplexity = 2^(bits per word). Some language modelers improve their numbers by removing punctuation and capitalization, splitting words into a root and suffix, and mapping rare words outside a 20K word dictionary to a common token. In data compression we consider that cheating. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T2a0cd9d392f9ff94-M31d8f5c1210dc5cd98b32d87 Delivery options: https://agi.topicbox.com/groups/agi/subscription
