On 08/02/12 22:48, Jörn Kottmann wrote:
In OpenNLP the tokenization during training time and execution
time must be identical. Otherwise the performance goes down.
In your case it is whitespace tokenized during training
and tokenized with the english maxent tokenizer during run time.
Ok so you mean that i should train my own tokenizer which will return
tokens as " Folic " rather than "Folic"? How on earth can i do that? I
did try a week ago to train my own tokenizer but i got exactly the same
results as the pretrained one! I don't understand how i can make a
tokenizer that will include spaces...Tokens must NOT include leading
and trailing spaces, am i right?
Jim