On 3/4/11 1:06 PM, Rohana Rajapakse wrote:
Hi Jorn,
I have modified the toString() method in TokenSample.java as given below. This is to
add a<SPLIT> token before the token 's .
This helped me to train a tokenizer model that splits for eg. "it's" into two tokens "it"
and "'s" at the same time detokenizer rule (same as the rule for double quote) splitting single
quotes from expressions that are enclosed in between a pair of single quotes.
This does not handle other cases of single quotes (e.g. don't, can't etc and
names like O'Conner).
Had a look at the change. The tokenization information must be provided
to TokenSample, this class
then just encapsulates that knowledge. So it is not the responsibility
of it to figure out how
things should be tokenized or not.
In your case I think you can just add "'s" to your detokenizer
dictionary like this:
<entry operation="MOVE_LEFT">
<token>'s</token>
</entry>
Doesn't that fix your issue?
Jörn