On Thu, Aug 4, 2011 at 7:58 AM, Jörn Kottmann <[email protected]> wrote:
> Hi William, > > I saw your change to the alpha num optimization in the > tokenizer. > > I am aware of the fact that it is not perfect currently, especially > for non-english languages. In my opinion we should use unicode > to determine what is a letter and what is a numerical. > > Since it is a performance optimization I think we should > undo the change you made and rather look into the unicode approach. > > What do you think? > +1, but I don't know about the uincode approach. > > We might want more options anyway, e.g. a tokenization dictionary for > some frequent cases. In such a dictionary the tokenizer could lookup how > a certain input char sequence should be tokenized. > Yes. The F score of the models I create using OpenNLP tokenizer is high (>99%), but it fails in some cases, maybe because my training data don't have enough of these cases. I added the abbreviation dictionary, but it is not helping that much.
