On 3/4/11 1:06 PM, Rohana Rajapakse wrote:
Hi Jorn,



I have modified the toString() method in TokenSample.java as given below. This is to 
add a<SPLIT>  token before the token 's .

This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it" 
and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single 
quotes from expressions that are enclosed in between a pair of single quotes.

This does not handle other cases of single quotes (e.g. don't, can't etc and 
names like O'Conner).

Had a look at the change. The tokenization information must be provided to TokenSample, this class then just encapsulates that knowledge. So it is not the responsibility of it to figure out how
things should be tokenized or not.

In your case I think you can just add "'s" to your detokenizer dictionary like this:
<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>

Doesn't that fix your issue?

Jörn

Reply via email to