Hi James,
Ok i'll have a look at the new code but to be honest i 'm not having any
problems with the case-sensitivity flag...Also, are you referring to the
flag of the Dictionary or the trained NER model or both?
Sorry to ask the same question again but how on earth is the dictionary
supposed to recognise words like "Folic acid" when all that it sees are
"Folic" & "acid" ? I can understand that a properly trained model has
generated features in order to learn what tokens to recognise. This
usually includes words before and after the token in question amongst
other things.... But the dictionary which is purely an exhaustive,
iterative search based solely on a list of words - only looks at the
token in question so i don't think it is possible to find multi-word
entities using the DictionaryNameFinder.
Of course in my case, i 've got the same problem with both approaches
which comes down to the tokenization that happens before the dictionary
or the maxent model take over. But at least i can understand how the
maxent model can look for multi-word entities but i certainly don't
understand how the Dictionary can...
One of my fellow researchers suggested that as far as the maxent model
is concerned, i should look at the confidence probabilities of each
token in each sentence and concatenate any tokens with low enough
confidence probabilities in order to get my multi-word tokens. Of course
that is assuming that both "Folic" and "acid" are found by the NER and
that they have very similar confidence probabilities. But that is not
hte case with me...I mean it finds "Folic" and "Domoic" but NOT "acid"
which is what follows both folic and domoic...So again dead end!!!
Is there any chance you can point me to the person that wrote the
official tutorial for openNLP or the person that actually did the
training for the provided person name-finder model? I really want to
hear his/her opinion on the matter...My problem is that the tutorial
shows how to recognize person names (which are 2-3 words most of the
times) but i strongly suspect that this is impossible without doing
something extra which the tutorial unfortunately does not show...
By the way, the words i'm choosing to mention are not random...The words
folic acid and domoic acid appear at least 200 times in the training
data! The words "acid" appears at least 1000 times as part of other
drug-names but it is still not recognised by the name-finder!!! Not even
as a single token like it does with folic and domoic...
Regards,
Jim
p.s: have you ever done any serious NER (not for demonstration purposes)
using openNLP?
On 08/02/12 01:23, James Kosin wrote:
On 2/7/2012 5:29 AM, Jim - FooBar(); wrote:
Hey James,
First of all thanks for taking the time to reply to my massive e-mail.
I really appreciate it...
Secondly i think you slightly misunderstood my problems..to be fair
it's quite complicated!
Ok, so i'm using openNLP 1.5.2 so i don't think there is any newer
nameFinder code.
Hi again,
Actually, SVN has a later version that fixes a problem with the name
matching over long series. A bug in the code was causing the case
sensitivity flag to go away when matching other tokens in a series.
This was due to an attempted optimization that was working before we
fixed the case sensitivity issues which broke the optimization.
James