Hi Jorn,
I have modified the toString() method in TokenSample.java as given below. This
is to add a <SPLIT> token before the token 's .
This helped me to train a tokenizer model that splits for eg. "it's" into two
tokens "it" and "'s" at the same time detokenizer rule (same as the rule for
double quote) splitting single quotes from expressions that are enclosed in
between a pair of single quotes.
This does not handle other cases of single quotes (e.g. don't, can't etc and
names like O'Conner).
I am not sure if this change of code affects other functionalities of opennlp
(where else the TokenSample class used?) and if it was the right place to do it.
Please let me know what you think!
Regards
Rohana
public String toString() {
StringBuilder sentence = new StringBuilder();
int lastEndIndex = -1;
for (Span token : tokenSpans) {
if (lastEndIndex != -1) {
// If there are no chars between last token
// and this token insert the separator chars
// otherwise insert a space
String separator = "";
if (lastEndIndex == token.getStart())
separator = separatorChars;
else {
separator = " ";
//New condition for adding <SPLIT> before 's into the training
file when creating/converting conll03 to produce tokenizer training
//data using "TokenizerConverter"
if (token.getCoveredText(text).equals("'s")) {
separator = separatorChars;
}
}
sentence.append(separator);
}
sentence.append(token.getCoveredText(text));
lastEndIndex = token.getEnd();
}
return sentence.toString();
}
-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: 03 March 2011 16:01
To: [email protected]
Subject: Re: Tokenizer issue - Quotation marks
On 3/3/11 4:33 PM, Rohana Rajapakse wrote:
> Thanks. I have got the training files created (conll03 + Reuters) and models
> trained. Used the latin-detokenizer that came with the download. The trained
> model solves the double quotation problem (e.g. "mistakes" now results in
> three tokens: ", mistakes and ").
>
> I have tried adding the same detokenizer rules for single quote. However, it
> seems to conflict with the different usage of the single quote (e.g.
> possession as Tom's, It's etc.) This means we will have such cases
> separately. I will try adding<SPLIT> tags for those cases (e.g. Tom<SPLIT>'s
> , it<SPLIT>'s etc.). Don't know which gets the priority, rules in the
> detokenizer or<SPLIT> tags...
Yes you need to add all the tokens which should be attached to the
previous one, like "'s", "'t", etc.
It would be nice to have such a file as part of the project.
Jörn
GOSS community User Group for clients. Sign-up here:
www.gossinteractive.com/usergroup
Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter
Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street,
Plymouth, PL1 1LG. Company Registration No: 3553908
This email contains proprietary information, some or all of which may be
legally privileged. It is for the intended recipient only. If an addressing or
transmission error has misdirected this email, please notify the author by
replying to this email. If you are not the intended recipient you may not use,
disclose, distribute, copy, print or rely on this email.
Email transmission cannot be guaranteed to be secure or error free, as
information may be intercepted, corrupted, lost, destroyed, arrive late or
incomplete or contain viruses. This email and any files attached to it have
been checked with virus detection software before transmission. You should
nonetheless carry out your own virus check before opening any attachment. GOSS
Interactive Ltd accepts no liability for any loss or damage that may be caused
by software viruses.