Re: Training a tokenizer that doesn't tokenize automatically on spaces

Riccardo Tasso Mon, 13 Feb 2012 05:11:38 -0800

I can help you in doing this with the API.

You should train your own TokenizerModel just like another model:

TokenizerModel model = TokenizerME.train(language, sampleStream,useAlphaNumericOptimization, trainingParameters);


In your case, I suggest you to write your own TokenSampleStream class:

ObjectStream<TokenSample> sampleStream = new MyTokenSampleStream(...);

In this class you should of course implement theObjectStream<TokenSample> interface for which you must implement thefollowing method:


public TokenSample read()

A TokenSample basically has to be filled with a:
* String text; // which represents a sentence

* List<Span> tokenSpans; // which is the list of spans in which yoursentence must be tokenized.

e.g. TokenSampel t: {
text = "my token sample stream"
tokenSpans = { [0, 8], [9, 22] }
...
}

As you can see in this TokenSample there are two tokens: "my token" and"sample stream".

The constructor of MyTokenSampleStream should load the training data(from a file, from a database...whatever) and for each invocation of theread method you should return:

* a new TokenSample from your data
* null if you don't have more samples

The TokenizerME.train will read samples from your sampleStream and itwill train your custom model. Then you can save it or use it dependingon your needs.


Cheers,
    Riccardo

On 11/02/2012 18:46, Lee Hinman wrote:

Hey Guys,

I'm trying to train a tokenizer that ignores spaces and only uses<SPLIT>  to 
determine where to split. I wasn't able to find anything in the javadocs, is this 
possible with OpenNLP? If so, could someone point me in the right direction regarding 
it?

- Lee Hinman

Re: Training a tokenizer that doesn't tokenize automatically on spaces

Reply via email to