Tokenizer issue - Quotation marks

Rohana Rajapakse Tue, 22 Feb 2011 05:19:10 -0800

Hi,



I am using OpenNLP-1.5 to tokenize text. Tried the text The army had
made "mistakes".  It gives me "mistakes as a token (note starting quote
is part of the token). But, if I change the word mistake to Mistake
(i.e. capitol M) in the input text, then I get the token Mistakes
(correctly).



Anyone aware of this issue and any idea of how to get-around this?



Thanks



Rohana











GOSS community User Group for clients. Sign-up here: 
www.gossinteractive.com/usergroup

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG. Company Registration No: 3553908

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.

Tokenizer issue - Quotation marks

Reply via email to