Oh, yeah...another reason that test is messed up is that to make it easy for 
job interviews he defined tokens so `split` can work, but the text data itself 
has all kinds of punctuation in it. This artificially amplifies the "novel 
vocabulary per byte" by something like 3x, and that parameter really matters 
for the histogram sizes which matter for performance. So, a choice to contain 
complexity makes it unrepresentative for optimizers.

Reply via email to