Re: Problem with openNLP Name Finder API....

Jim - FooBar(); Fri, 10 Feb 2012 13:54:25 -0800

On 09/02/12 09:21, Joern Kottmann wrote:

In the NER training format all tokens are separated by white spaces.
So you always need a space between two tokens.


Try our command line tokenizer, it will output white space tokenized text.

Hi there Joern,

I followed your suggestion and it turns out that by using the commandline tool i am indeed able to recognise multi-word tokens!!! This is alli've been hoping for honestly...


HOWEVER, I was not able to reproduce it in my program no matter what!!!
You see,

 * i devoted a whole day into properly merging 3 papers so i have some
   training material
 * I used the command line to to train a nameFinder on with that tr. data.
 * I then  used  the command line tool again passing it the model i
   just trained and 4 sentences that contained some obvious drug-names
 * and it came back with correct annotations which means it did
   recognise everything. Even Folic acid!!! This is good news....
 * I then tried to load that same model in my program and pass it the
   exact same sentences but now it comes back with single-word
   entities. 2 of them may be part of the actual ones (folic is part of
   "folic acid") but why am i not getting the full names?

Now you said last that i should not be using the pre-trained maxentenglish tokenizer becasue it does not add spaces to the tokens. I havetried surrounding the tokens with spaces before the nameFinder sees thembut that is obviously wrong because i'm not getting anything back atall!!! Not even the single word entities or just "folic"... Also, whatyou said made me wonder...

If tokenization is so important between training and runtime then whattokenizer does the command line tool use? Does it not use thepre-trained english one when you try to query a model with somesentences? If entities in the training data for the nameFinder HAVE TObe surrounded by spaces wouldn't it make sense for the pre-trainedtokenizer to do the same? and vice-versa... Since the english tokenizerdoes not include spaces would it not make more sense to NOT have spacesin the training data as well so the 2 can co-operate?

Using the command line tool i did not have to do any tokenization sopresumably it happens internally...

What tokenizer is being used?

Why can i not reproduce the results from my program even though i'musing the right model?Just as a reference i shall mention that i'm suing a buffered readerto read the lines from the file that contains the 4 sentences. The 4sentences just happen to be on separate lines, but sentence-detectorcan cope with that...


Thanks in advance,
Jim

Re: Problem with openNLP Name Finder API....

Reply via email to