Re: Tokenizer issue - Quotation marks

Jörn Kottmann Fri, 04 Mar 2011 04:34:16 -0800

On 3/4/11 1:06 PM, Rohana Rajapakse wrote:

Hi Jorn,




I have modified the toString() method in TokenSample.java as given below. This is to 
add a<SPLIT>  token before the token 's .

This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" 
and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single 
quotes from expressions that are enclosed in between a pair of single quotes.

This does not handle other cases of single quotes (e.g. don't, can't etc and 
names like O'Conner).

Had a look at the change. The tokenization information must be providedto TokenSample, this classthen just encapsulates that knowledge. So it is not the responsibilityof it to figure out how

things should be tokenized or not.

In your case I think you can just add "'s" to your detokenizerdictionary like this:

<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>

Doesn't that fix your issue?

Jörn

Re: Tokenizer issue - Quotation marks

Reply via email to