I can help you in doing this with the API.
You should train your own TokenizerModel just like another model:
TokenizerModel model = TokenizerME.train(language, sampleStream,
useAlphaNumericOptimization, trainingParameters);
In your case, I suggest you to write your own TokenSampleStream class:
ObjectStream<TokenSample> sampleStream = new MyTokenSampleStream(...);
In this class you should of course implement the
ObjectStream<TokenSample> interface for which you must implement the
following method:
public TokenSample read()
A TokenSample basically has to be filled with a:
* String text; // which represents a sentence
* List<Span> tokenSpans; // which is the list of spans in which your
sentence must be tokenized.
e.g. TokenSampel t: {
text = "my token sample stream"
tokenSpans = { [0, 8], [9, 22] }
...
}
As you can see in this TokenSample there are two tokens: "my token" and
"sample stream".
The constructor of MyTokenSampleStream should load the training data
(from a file, from a database...whatever) and for each invocation of the
read method you should return:
* a new TokenSample from your data
* null if you don't have more samples
The TokenizerME.train will read samples from your sampleStream and it
will train your custom model. Then you can save it or use it depending
on your needs.
Cheers,
Riccardo
On 11/02/2012 18:46, Lee Hinman wrote:
Hey Guys,
I'm trying to train a tokenizer that ignores spaces and only uses<SPLIT> to
determine where to split. I wasn't able to find anything in the javadocs, is this
possible with OpenNLP? If so, could someone point me in the right direction regarding
it?
- Lee Hinman