Re: Tokenizer issue - Quotation marks

James Kosin Tue, 22 Feb 2011 19:07:30 -0800

On 2/22/2011 8:18 AM, Rohana Rajapakse wrote:
> Hi,
>
>  
>
> I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
> made "mistakes".  It gives me "mistakes as a token (note starting quote
> is part of the token). But, if I change the word mistake to Mistake
> (i.e. capitol M) in the input text, then I get the token Mistakes
> (correctly). 
>
>  
>
> Anyone aware of this issue and any idea of how to get-around this?
>
>  
>
> Thanks
>
>  
>
> Rohana
I see the issue, unfortunately, can't do much about fixing this without
the training data used for the tokenizer.  You can use the
SimpleTokenizer and that appears to be working with your sample.


I found a few more samples that don't work:
    This model is "bad."
    This is the "year" of the "pig."

It seems to be a problem following a " with certain characters.

James

Re: Tokenizer issue - Quotation marks

Reply via email to