Re: AlphaNumOpt in tokenizer

[email protected] Thu, 04 Aug 2011 07:52:54 -0700

On Thu, Aug 4, 2011 at 7:58 AM, Jörn Kottmann <[email protected]> wrote:


> Hi William,
>
> I saw your change to the alpha num optimization in the
> tokenizer.
>
> I am aware of the fact that it is not perfect currently, especially
> for non-english languages. In my opinion we should use unicode
> to determine what is a letter and what is a numerical.
>
> Since it is a performance optimization I think we should
> undo the change you made and rather look into the unicode approach.
>
> What do you think?
>

+1, but I don't know about the uincode approach.


>
> We might want more options anyway, e.g. a tokenization dictionary for
> some frequent cases. In such a dictionary the tokenizer could lookup how
> a certain input char sequence should be tokenized.
>

Yes. The F score of the models I create using OpenNLP tokenizer is high
(>99%), but it fails in some cases, maybe because my training data don't
have enough of these cases.
I added the abbreviation dictionary, but it is not helping that much.

Re: AlphaNumOpt in tokenizer

Reply via email to