On 09/02/12 09:21, Joern Kottmann wrote:
In the NER training format all tokens are separated by white spaces.
So you always need a space between two tokens.
Try our command line tokenizer, it will output white space tokenized text.
Hi there Joern,
I followed your suggestion and it turns out that by using the command
line tool i am indeed able to recognise multi-word tokens!!! This is all
i've been hoping for honestly...
HOWEVER, I was not able to reproduce it in my program no matter what!!!
You see,
* i devoted a whole day into properly merging 3 papers so i have some
training material
* I used the command line to to train a nameFinder on with that tr. data.
* I then used the command line tool again passing it the model i
just trained and 4 sentences that contained some obvious drug-names
* and it came back with correct annotations which means it did
recognise everything. Even Folic acid!!! This is good news....
* I then tried to load that same model in my program and pass it the
exact same sentences but now it comes back with single-word
entities. 2 of them may be part of the actual ones (folic is part of
"folic acid") but why am i not getting the full names?
Now you said last that i should not be using the pre-trained maxent
english tokenizer becasue it does not add spaces to the tokens. I have
tried surrounding the tokens with spaces before the nameFinder sees them
but that is obviously wrong because i'm not getting anything back at
all!!! Not even the single word entities or just "folic"... Also, what
you said made me wonder...
If tokenization is so important between training and runtime then what
tokenizer does the command line tool use? Does it not use the
pre-trained english one when you try to query a model with some
sentences? If entities in the training data for the nameFinder HAVE TO
be surrounded by spaces wouldn't it make sense for the pre-trained
tokenizer to do the same? and vice-versa... Since the english tokenizer
does not include spaces would it not make more sense to NOT have spaces
in the training data as well so the 2 can co-operate?
Using the command line tool i did not have to do any tokenization so
presumably it happens internally...
What tokenizer is being used?
Why can i not reproduce the results from my program even though i'm
using the right model?
Just as a reference i shall mention that i'm suing a buffered reader
to read the lines from the file that contains the 4 sentences. The 4
sentences just happen to be on separate lines, but sentence-detector
can cope with that...
Thanks in advance,
Jim