Re: Problem with openNLP Name Finder API....

william.co...@gmail.com Fri, 10 Feb 2012 18:25:57 -0800

Hi,

On Fri, Feb 10, 2012 at 9:42 PM, Jim - FooBar(); <jimpil1...@gmail.com>wrote:


>  Can you post the peace of code where you use the API so we can check if it
>> is OK?
>>
> That is just absolutely fine as long as you can read Clojure code...
> basically the function in question is very simple. Both the "tokenize" &
> the "get-sentences" functions have been checked that they work properly. In
> fact i am using them in my Dictionary lookup as well.  "get-sentences"
> returns a vector of sentences and "tokenize" returns a vector of tokens.
> Both are the pre-trained maxent models. Here it is:
>
> /(defn find-names-model [text]
> (map #(drug-find (nlp/tokenize %))
>            (nlp/get-sentences text)))/
>
> In OOP terms--> for each sentence in text
>                           drug-find(tokenize sentence);


I can't read Closure.
To simplify your code you should create a Java function wrapping everything
you need from OpenNLP.
IMPORTANT: OpenNLP tools are not thread safe! Check if Closure is accessing
the tool from multiple threads, this would lead to unpredictable behavior.


>
>
>  To create a analyzer that can process texts, you should use the
>> SentenceDetector, the Tokenizer and finally the Name Finder.
>> Make sure you are creating the data structures correctly. Refer to the
>> documentation to learn the input and output of each module of the pipe.
>>
> This is exactly what i am doing...However Joern said that i 'm separating
> entity tokenS with white-space in my training data (e.g. <START:drug>
> whatever <END>) but i'm using the pretrained maxent tokenizer at runtime
> which does not separate tokens by whitespace!


I can't understand what you mean with that. Tokenizer returns an array of
tokens that you can use as input of the name finder. Tokenizer will always
split tokens in whitespace, and will decide using the Maxent framework if
should split in other cases. For example "Mr. Jim is here." => [Mr.] [Jim]
[is] [here] [.]... the only place where the algorithm had to decide if to
split or not was the dot at "Mr." and the dot at "here.".



> Of course in the sgml tag there is no other choice but to include spaces
> (exceptions otherwise) so in principle what Joern said is unavoidable!!!
>
>
>  Yes, you should try to use the sentence detector and tokenizer models
>> distributed by OpenNLP.
>>
>
>  I thought so because that is when i get the best results. Even without
> multi-word entities at least its finding something!!! However it really
> frustrates me that i could get the desired behaviour from the command line
> but not from the API...I've spent a couple of days preparing the data just
> to confirm that openNLP could identify multi-word entities and it turns out
> it can but from the command line! what is there that the command line is
> doing differently?presumably either sentence detection or tokenization cos
> neither of those are necessary to be performed when using the command line
> tool. Apart from these 2, that i am explicitly doing in my program,  i am
> using exactly the same data...
>

The API works. Probably the issue is related with how you are using it. I
would check if:
 - the Closure code is using the right Java data structures;
 - not calling OpenNLP modules from multiple threads;

Also you could get the source code from SVN and add some trace to debug
what opennlp receives from the Closure code.

William

Re: Problem with openNLP Name Finder API....

Reply via email to