HI there,

I can't read Closure.
To simplify your code you should create a Java function wrapping everything
you need from OpenNLP.
Ok no problem...the thing is if the problem was Java interoperability (passing the wrong data-structures to Java) i would expect exceptions and not something that half-works!!! On top of that i would never contact the forum for mere exceptions...I spent my fair share of sorting those out so the program compiles and runs exception-free... I'm not saying it is impossible but rather highly unlikely cos Java is strongly typed and it would complain with the slightest misconfiguration!

IMPORTANT: OpenNLP tools are not thread safe! Check if Closure is accessing
the tool from multiple threads, this would lead to unpredictable behavior.
Nothing concurrent so far....Maybe in the future but not for now...

I can't understand what you mean with that. Tokenizer returns an array of
tokens that you can use as input of the name finder. Tokenizer will always
split tokens in whitespace, and will decide using the Maxent framework if
should split in other cases. For example "Mr. Jim is here." =>  [Mr.] [Jim]
[is] [here] [.]... the only place where the algorithm had to decide if to
split or not was the dot at "Mr." and the dot at "here.".

I'm with you man, i could not understand it either when i read it...If you follow the thread you'll see that a couple of messages back this was what was suggested to me! That is why i tried training my own tokenizer but it turns out i'm getting exactly the same results so i'm back to the pre-trained one that comes with openNLP...

The API works. Probably the issue is related with how you are using it. I
would check if:
  - the Closure code is using the right Java data structures;
  - not calling OpenNLP modules from multiple threads;

Ok so you're saying that the command line took does nothing more than the API, is that right? The command line tool uses the pre-trained tokenizer and sentence detector yes?

Regards,
Jim


On 11/02/12 02:25, william.co...@gmail.com wrote:
Hi,

On Fri, Feb 10, 2012 at 9:42 PM, Jim - FooBar();<jimpil1...@gmail.com>wrote:

  Can you post the peace of code where you use the API so we can check if it
is OK?

That is just absolutely fine as long as you can read Clojure code...
basically the function in question is very simple. Both the "tokenize"&
the "get-sentences" functions have been checked that they work properly. In
fact i am using them in my Dictionary lookup as well.  "get-sentences"
returns a vector of sentences and "tokenize" returns a vector of tokens.
Both are the pre-trained maxent models. Here it is:

/(defn find-names-model [text]
(map #(drug-find (nlp/tokenize %))
            (nlp/get-sentences text)))/

In OOP terms-->  for each sentence in text
                           drug-find(tokenize sentence);

I can't read Closure.
To simplify your code you should create a Java function wrapping everything
you need from OpenNLP.
IMPORTANT: OpenNLP tools are not thread safe! Check if Closure is accessing
the tool from multiple threads, this would lead to unpredictable behavior.



  To create a analyzer that can process texts, you should use the
SentenceDetector, the Tokenizer and finally the Name Finder.
Make sure you are creating the data structures correctly. Refer to the
documentation to learn the input and output of each module of the pipe.

This is exactly what i am doing...However Joern said that i 'm separating
entity tokenS with white-space in my training data (e.g.<START:drug>
whatever<END>) but i'm using the pretrained maxent tokenizer at runtime
which does not separate tokens by whitespace!

I can't understand what you mean with that. Tokenizer returns an array of
tokens that you can use as input of the name finder. Tokenizer will always
split tokens in whitespace, and will decide using the Maxent framework if
should split in other cases. For example "Mr. Jim is here." =>  [Mr.] [Jim]
[is] [here] [.]... the only place where the algorithm had to decide if to
split or not was the dot at "Mr." and the dot at "here.".



Of course in the sgml tag there is no other choice but to include spaces
(exceptions otherwise) so in principle what Joern said is unavoidable!!!


  Yes, you should try to use the sentence detector and tokenizer models
distributed by OpenNLP.

  I thought so because that is when i get the best results. Even without
multi-word entities at least its finding something!!! However it really
frustrates me that i could get the desired behaviour from the command line
but not from the API...I've spent a couple of days preparing the data just
to confirm that openNLP could identify multi-word entities and it turns out
it can but from the command line! what is there that the command line is
doing differently?presumably either sentence detection or tokenization cos
neither of those are necessary to be performed when using the command line
tool. Apart from these 2, that i am explicitly doing in my program,  i am
using exactly the same data...

The API works. Probably the issue is related with how you are using it. I
would check if:
  - the Closure code is using the right Java data structures;
  - not calling OpenNLP modules from multiple threads;

Also you could get the source code from SVN and add some trace to debug
what opennlp receives from the Closure code.

William


Reply via email to