Starting in the 1990's, papers on language modeling would remove case and punctuation and replace words not in the top 20,000 with a symbol for unknown words. Then they calculate perplexity as 2^(bits per word) on a test set after training. All of this makes the numbers look better, but you have to use the same methodology to show that your algorithm works better than prior work, because otherwise you don't publish and you perish. Yes, it perverts science.
On Tue, Sep 14, 2021, 6:14 AM <[email protected]> wrote: > Maybe people here know the answer. > > Searching google, emails, reddit post, youtube replies, and chat > questions later, no one still after 4 days has answered.... > > I tried running others code too... You'd have to look at many others codes > and they are not small codes. > > https://paperswithcode.com/sota/language-modelling-on-wikitext-2 > > The link above does not specify if we predict words (always separated by a > space) or parts of words, which is the right way? If parts, what BPE method > do I use? This would make results uncomparable if I don't predict the right > things and the right amount of things. Predicting letters give me a > Perplexity of about 2 because letters are easier to predict. BTW predicting > letters make prediction worse, and you can't see that unless use the Hutter > Prize evaluation. > > Do I predict spaces?.....Commas? Periods?......<UNK>? <eos>? > > This makes the Hutter Prize and the Large Text Compression Benchmark look > 5x more than already was like shining gold compared to Perplexity > benchmarks. Without strict rules and FAQ and people that reply back, > Perplexity is a breeding ground for papers that want to pass a grade by > saying my algo got 5 more points down that some other SOTA algo, without > explaining how they got that score. > *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + > delivery options <https://agi.topicbox.com/groups/agi/subscription> > Permalink > <https://agi.topicbox.com/groups/agi/Tc9a99c50a9ec758e-M585a45b1a2857e9c166fe44b> > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tc9a99c50a9ec758e-M1fcaf8f4d2e3a795c0a37494 Delivery options: https://agi.topicbox.com/groups/agi/subscription
