Hi James,

Ok i'll have a look at the new code but to be honest i 'm not having any problems with the case-sensitivity flag...Also, are you referring to the flag of the Dictionary or the trained NER model or both?

Sorry to ask the same question again but how on earth is the dictionary supposed to recognise words like "Folic acid" when all that it sees are "Folic" & "acid" ? I can understand that a properly trained model has generated features in order to learn what tokens to recognise. This usually includes words before and after the token in question amongst other things.... But the dictionary which is purely an exhaustive, iterative search based solely on a list of words - only looks at the token in question so i don't think it is possible to find multi-word entities using the DictionaryNameFinder.

Of course in my case, i 've got the same problem with both approaches which comes down to the tokenization that happens before the dictionary or the maxent model take over. But at least i can understand how the maxent model can look for multi-word entities but i certainly don't understand how the Dictionary can...

One of my fellow researchers suggested that as far as the maxent model is concerned, i should look at the confidence probabilities of each token in each sentence and concatenate any tokens with low enough confidence probabilities in order to get my multi-word tokens. Of course that is assuming that both "Folic" and "acid" are found by the NER and that they have very similar confidence probabilities. But that is not hte case with me...I mean it finds "Folic" and "Domoic" but NOT "acid" which is what follows both folic and domoic...So again dead end!!!

Is there any chance you can point me to the person that wrote the official tutorial for openNLP or the person that actually did the training for the provided person name-finder model? I really want to hear his/her opinion on the matter...My problem is that the tutorial shows how to recognize person names (which are 2-3 words most of the times) but i strongly suspect that this is impossible without doing something extra which the tutorial unfortunately does not show...

By the way, the words i'm choosing to mention are not random...The words folic acid and domoic acid appear at least 200 times in the training data! The words "acid" appears at least 1000 times as part of other drug-names but it is still not recognised by the name-finder!!! Not even as a single token like it does with folic and domoic...

Regards,
Jim

p.s: have you ever done any serious NER (not for demonstration purposes) using openNLP?




On 08/02/12 01:23, James Kosin wrote:
On 2/7/2012 5:29 AM, Jim - FooBar(); wrote:
Hey James,

First of all thanks for taking the time to reply to my massive e-mail.
I really appreciate it...

Secondly i think you slightly misunderstood my problems..to be fair
it's quite complicated!

Ok, so i'm using openNLP 1.5.2 so i don't think there is any newer
nameFinder code.

Hi again,

Actually, SVN has a later version that fixes a problem with the name
matching over long series.  A bug in the code was causing the case
sensitivity flag to go away when matching other tokens in a series.
This was due to an attempted optimization that was working before we
fixed the case sensitivity issues which broke the optimization.

James

Reply via email to