On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
>
> Thanks
>
>  
>
> Rohana
I see the issue, unfortunately, can't do much about fixing this without
the training data used for the tokenizer.  You can use the
SimpleTokenizer and that appears to be working with your sample.

I found a few more samples that don't work:
    This model is "bad."
    This is the "year" of the "pig."

It seems to be a problem following a " with certain characters.

James

Reply via email to