Hi William, I saw your change to the alpha num optimization in the tokenizer.
I am aware of the fact that it is not perfect currently, especially for non-english languages. In my opinion we should use unicode to determine what is a letter and what is a numerical. Since it is a performance optimization I think we should undo the change you made and rather look into the unicode approach. What do you think? We might want more options anyway, e.g. a tokenization dictionary for some frequent cases. In such a dictionary the tokenizer could lookup how a certain input char sequence should be tokenized. Jörn
