Oh, yeah...another reason that test is messed up is that to make it easy for job interviews he defined tokens so `split` can work, but the text data itself has all kinds of punctuation in it. This artificially amplifies the "novel vocabulary per byte" by something like 3x, and that parameter really matters for the histogram sizes which matter for performance. So, a choice to contain complexity makes it unrepresentative for optimizers.
- Counting word frequencies with Nim Zoom
- Counting word frequencies with Nim cblake
- Counting word frequencies with Nim Zoom
- Counting word frequencies with Nim cblake