RE: Tokenizer issue - Quotation marks

Rohana Rajapakse Fri, 04 Mar 2011 04:07:28 -0800

Hi Jorn,

I have modified the toString() method in TokenSample.java as given below. This 
is to add a <SPLIT> token before the token 's .

This helped me to train a tokenizer model that splits for eg. "it's" into two 
tokens "it" and "'s" at the same time detokenizer rule (same as the rule for 
double quote) splitting single quotes from expressions that are enclosed in 
between a pair of single quotes.

This does not handle other cases of single quotes (e.g. don't, can't etc and 
names like O'Conner).

I am not sure if this change of code affects other functionalities of opennlp 
(where else the TokenSample class used?) and if it was the right place to do it.

Please let me know what you think!

Regards

Rohana

  public String toString() {

    StringBuilder sentence = new StringBuilder();

    int lastEndIndex = -1;

    for (Span token : tokenSpans) {

      if (lastEndIndex != -1) {

        // If there are no chars between last token

        // and this token insert the separator chars

        // otherwise insert a space

        String separator = "";

        if (lastEndIndex == token.getStart())

          separator = separatorChars;

         else {

             separator = " ";

             //New condition for adding <SPLIT> before 's into the training 
file when creating/converting conll03 to produce tokenizer training

             //data using "TokenizerConverter"

             if (token.getCoveredText(text).equals("'s")) {

                   separator = separatorChars;

             }

         }

        sentence.append(separator);

      }

      sentence.append(token.getCoveredText(text));

      lastEndIndex = token.getEnd();

    }

    return sentence.toString();

  }

-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: 03 March 2011 16:01
To: [email protected]
Subject: Re: Tokenizer issue - Quotation marks

On 3/3/11 4:33 PM, Rohana Rajapakse wrote:

> Thanks. I have got the training files created (conll03 + Reuters) and models 
> trained. Used the latin-detokenizer that came with the download. The trained 
> model solves the double quotation problem (e.g. "mistakes" now results in 
> three tokens: ", mistakes and ").

>

> I have tried adding the same detokenizer rules for single quote. However, it 
> seems to conflict with the different usage of the single quote (e.g. 
> possession as Tom's, It's etc.) This means we will have such cases 
> separately. I will try adding<SPLIT>  tags for those cases (e.g. Tom<SPLIT>'s 
> , it<SPLIT>'s  etc.). Don't know which gets the priority, rules in the 
> detokenizer or<SPLIT>  tags...

Yes you need to add all the tokens which should be attached to the

previous one, like "'s", "'t", etc.

It would be nice to have such a file as part of the project.

Jörn

GOSS community User Group for clients. Sign-up here: 
www.gossinteractive.com/usergroup

Have you registered for our e-Newsletter? www.gossinteractive.com/newsletter

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG. Company Registration No: 3553908

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email.

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.

RE: Tokenizer issue - Quotation marks

Reply via email to