In the NER training format all tokens are separated by white spaces. So you always need a space between two tokens.
Try our command line tokenizer, it will output white space tokenized text. Jörn On Thu, Feb 9, 2012 at 10:10 AM, Jim - FooBar(); <jimpil1...@gmail.com>wrote: > On 08/02/12 22:48, Jörn Kottmann wrote: > >> In OpenNLP the tokenization during training time and execution >> time must be identical. Otherwise the performance goes down. >> In your case it is whitespace tokenized during training >> and tokenized with the english maxent tokenizer during run time. >> > > Ok so you mean that i should train my own tokenizer which will return > tokens as " Folic " rather than "Folic"? How on earth can i do that? I did > try a week ago to train my own tokenizer but i got exactly the same > results as the pretrained one! I don't understand how i can make a > tokenizer that will include spaces...Tokens must NOT include leading and > trailing spaces, am i right? > > Jim >