I don't think you can prove Perplexity is better than Lossless Compression for AI prediction evaluation. The Hutter Prize and Matt Mahoney's Large Text Compression Benchmark have a lot of scores to compare to but not as many, but it tells you your score more firmly. I tried asking yesterday how Perplexity works, and its issue isn't just test/validation set data leaking, it has a other issue that is actually a problem unlike the leaking one which may not be a big problem or usually occur to you; in Perplexity you have to get the average prediction accuracy for each item in the test set, great, but you have to normalize it using the total counts of all tokens in the test set so it is dataset size invariant.....this doesn't seem good but may be ok like the data leak possibleissue issue.....also the fact that I predict 1-3 letters and they predict tokens means my average would have more samples and would be easier to predict letter than a word (much hard for them), and my norm would be totally different too, so I CAN'T compete unless my Byte Pair Encoder is the same. I use Byron Knoll's cmix pre-processor, which is close, maybe it is BPE. Have to check still.
SO: Perplexity has more scores to compare to than my not-well-known-but-otherwise-Best-benchmark(HP<CB), it has less scores to compare to because if you work with BPE than letter ones can't compare scores (you may have half or none scores left to compare to! ouch!), it is not bullet proof. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T31c4c6495649906f-M33b160490acd421b5623e8e1 Delivery options: https://agi.topicbox.com/groups/agi/subscription
