In the NER training format all tokens are separated by white spaces.
So you always need a space between two tokens.

Try our command line tokenizer, it will output white space tokenized text.

Jörn

On Thu, Feb 9, 2012 at 10:10 AM, Jim - FooBar(); <jimpil1...@gmail.com>wrote:

> On 08/02/12 22:48, Jörn Kottmann wrote:
>
>> In OpenNLP the tokenization during training time and execution
>> time must be identical. Otherwise the performance goes down.
>> In your case it is whitespace tokenized during training
>> and tokenized with the english maxent tokenizer during run time.
>>
>
> Ok so you mean that i should train my own tokenizer which will return
> tokens as " Folic " rather than "Folic"? How on earth can i do that? I did
> try a week ago to train my own tokenizer but i  got exactly the same
> results as the pretrained one! I don't understand how i can make a
> tokenizer that will include spaces...Tokens must  NOT include leading and
> trailing  spaces, am i right?
>
> Jim
>

Reply via email to