On 08/02/12 22:48, Jörn Kottmann wrote:
In OpenNLP the tokenization during training time and execution
time must be identical. Otherwise the performance goes down.
In your case it is whitespace tokenized during training
and tokenized with the english maxent tokenizer during run time.

Ok so you mean that i should train my own tokenizer which will return tokens as " Folic " rather than "Folic"? How on earth can i do that? I did try a week ago to train my own tokenizer but i got exactly the same results as the pretrained one! I don't understand how i can make a tokenizer that will include spaces...Tokens must NOT include leading and trailing spaces, am i right?

Jim

Reply via email to