Can you post the peace of code where you use the API so we can check if it
is OK?
That is just absolutely fine as long as you can read Clojure code...
basically the function in question is very simple. Both the "tokenize" & the "get-sentences" functions have been checked that they work properly. In fact i am using them in my Dictionary lookup as well. "get-sentences" returns a vector of sentences and "tokenize" returns a vector of tokens. Both are the pre-trained maxent models. Here it is:

/(defn find-names-model [text]
(map #(drug-find (nlp/tokenize %))
            (nlp/get-sentences text)))/

In OOP terms--> for each sentence in text
                           drug-find(tokenize sentence);

To create a analyzer that can process texts, you should use the
SentenceDetector, the Tokenizer and finally the Name Finder.
Make sure you are creating the data structures correctly. Refer to the
documentation to learn the input and output of each module of the pipe.
This is exactly what i am doing...However Joern said that i 'm separating entity tokenS with white-space in my training data (e.g. <START:drug> whatever <END>) but i'm using the pretrained maxent tokenizer at runtime which does not separate tokens by whitespace! Of course in the sgml tag there is no other choice but to include spaces (exceptions otherwise) so in principle what Joern said is unavoidable!!!

Yes, you should try to use the sentence detector and tokenizer models
distributed by OpenNLP.

I thought so because that is when i get the best results. Even without multi-word entities at least its finding something!!! However it really frustrates me that i could get the desired behaviour from the command line but not from the API...I've spent a couple of days preparing the data just to confirm that openNLP could identify multi-word entities and it turns out it can but from the command line! what is there that the command line is doing differently?presumably either sentence detection or tokenization cos neither of those are necessary to be performed when using the command line tool. Apart from these 2, that i am explicitly doing in my program, i am using exactly the same data...

Thanks for taking the time...

Jim


On 10/02/12 23:11, william.co...@gmail.com wrote:
Can you post the peace of code where you use the API so we can check if it
is OK?

To create a analyzer that can process texts, you should use the
SentenceDetector, the Tokenizer and finally the Name Finder.
Make sure you are creating the data structures correctly. Refer to the
documentation to learn the input and output of each module of the pipe.

Yes, you should try to use the sentence detector and tokenizer models
distributed by OpenNLP.


On Fri, Feb 10, 2012 at 7:53 PM, Jim - FooBar();<jimpil1...@gmail.com>wrote:

On 09/02/12 09:21, Joern Kottmann wrote:

In the NER training format all tokens are separated by white spaces.
So you always need a space between two tokens.

Try our command line tokenizer, it will output white space tokenized text.

Hi there Joern,

I followed your suggestion and it turns out that by using the command line
tool i am indeed able to recognise multi-word tokens!!! This is all i've
been hoping for honestly...

HOWEVER, I was not able to reproduce it in my program no matter what!!!
You see,

  * i devoted a whole day into properly merging 3 papers so i have some
   training material
  * I used the command line to to train a nameFinder on with that tr. data.
  * I then  used  the command line tool again passing it the model i
   just trained and 4 sentences that contained some obvious drug-names
  * and it came back with correct annotations which means it did
   recognise everything. Even Folic acid!!! This is good news....
  * I then tried to load that same model in my program and pass it the
   exact same sentences but now it comes back with single-word
   entities. 2 of them may be part of the actual ones (folic is part of
   "folic acid") but why am i not getting the full names?

Now you said last that i should not be using the pre-trained maxent
english tokenizer becasue it does not add spaces to the tokens. I have
tried surrounding the tokens with spaces before the nameFinder sees them
but that is obviously wrong because i'm not getting anything back at all!!!
Not even the single word entities or just "folic"... Also, what you said
made me wonder...

If tokenization is so important between training and runtime then what
tokenizer does the command line tool use? Does it not use the pre-trained
english one when you try to query a model with some sentences? If entities
in the training data for the nameFinder HAVE TO  be surrounded by spaces
wouldn't it make sense for the pre-trained tokenizer to do the same? and
vice-versa... Since the english tokenizer does not include spaces would it
not make more sense to NOT have spaces in the training data as well so the
2 can co-operate?

Using the command line tool i did not have to do any tokenization so
presumably it happens internally...
What tokenizer is being used?
Why can i not reproduce the results from my program even though i'm using
the right model?
  Just as a reference i shall mention that i'm suing a buffered reader to
read the lines from the file that contains the 4 sentences. The 4 sentences
just  happen to  be on separate lines, but sentence-detector can cope with
that...

Thanks in advance,
Jim



Reply via email to