On 09/02/12 09:21, Joern Kottmann wrote:
In the NER training format all tokens are separated by white spaces.
So you always need a space between two tokens.

Try our command line tokenizer, it will output white space tokenized text.
Hi there Joern,

I followed your suggestion and it turns out that by using the command line tool i am indeed able to recognise multi-word tokens!!! This is all i've been hoping for honestly...

HOWEVER, I was not able to reproduce it in my program no matter what!!!
You see,

 * i devoted a whole day into properly merging 3 papers so i have some
   training material
 * I used the command line to to train a nameFinder on with that tr. data.
 * I then  used  the command line tool again passing it the model i
   just trained and 4 sentences that contained some obvious drug-names
 * and it came back with correct annotations which means it did
   recognise everything. Even Folic acid!!! This is good news....
 * I then tried to load that same model in my program and pass it the
   exact same sentences but now it comes back with single-word
   entities. 2 of them may be part of the actual ones (folic is part of
   "folic acid") but why am i not getting the full names?

Now you said last that i should not be using the pre-trained maxent english tokenizer becasue it does not add spaces to the tokens. I have tried surrounding the tokens with spaces before the nameFinder sees them but that is obviously wrong because i'm not getting anything back at all!!! Not even the single word entities or just "folic"... Also, what you said made me wonder...

If tokenization is so important between training and runtime then what tokenizer does the command line tool use? Does it not use the pre-trained english one when you try to query a model with some sentences? If entities in the training data for the nameFinder HAVE TO be surrounded by spaces wouldn't it make sense for the pre-trained tokenizer to do the same? and vice-versa... Since the english tokenizer does not include spaces would it not make more sense to NOT have spaces in the training data as well so the 2 can co-operate?

Using the command line tool i did not have to do any tokenization so presumably it happens internally...
What tokenizer is being used?
Why can i not reproduce the results from my program even though i'm using the right model? Just as a reference i shall mention that i'm suing a buffered reader to read the lines from the file that contains the 4 sentences. The 4 sentences just happen to be on separate lines, but sentence-detector can cope with that...

Thanks in advance,
Jim

Reply via email to