Hello openNLP users/developers,
I really need your help with openNLP NameFinder...Let me explain:
I am writing a drug-entity recogniser (NER for drugs) and i 'm using the
openNLP API from Clojure. I don't have annotated text but i do an
up-to-date dictionary of drugs (drugbank.xml). The dictionary includes
all sorts of information so it needs a bit of preprocessing to extract
the names and synonyms. Anyway my problem is twofold:
As i said i don't have annotated text, but i thought maybe i can make
one. You see, i do have 383 pharmacology papers in raw text so i thought
why not use the names from the dictionary and build regex patterns to
replace all occurrences of that entry with the appropriate "<START:drug>
/drug-name/ <END>" annotation tag. Now, you may be wondering at this
point why on earth am i not using all the names i extracted from
drugbank.xml to build a proper openNLP dictionary to do lookup...Well,
apart from not being what i'm trying to do here, i've already tried
doing that but i got poor results simply because words like "Folic
acid" are being tokenized as 2 tokens rather than 1 and thus, the entry
"Folic acid" in the dictionary matches no token! Even if it worked
thoughi would still have to train a maxent model to recognise drugs that
may not exist in the dictionary (brand new for instance). My initial
approach was a bit fiddly but it sort of paid of at the end. I now have
a small program that expects some text and, for each entry in the
dictionary (6707 in total) it finds and replaces any occurrences of that
entry in the text with the expected openNLP format for training. Here is
where the 1st problem happens. I can tweak my regex pattern to add
spaces to the entity tag or not, with the following results :
<START:drug> drug-name <END> (with spaces inside)
causes problems for me because i get nested tags. To understand why
think about the words "Folic acid" for example. Lets assume that folic
acid is entry 3 in the dictionary and that entry 115 is "acid". First
time round it will produce <START:drug> Folic acid <END> but when it
processes "acid" it will match the word acid already tagged with "Folic
acid". You can see where this is going. If you happen to have a complex
compound name and after a while a slightly less complex compound name
(maybe a word shorter or something), and then a smaller one, they can
easily start to nest, especially when dealing with drug names. Now the
easy and straightforward solution to that is to NOT add spaces in the
tag like this :
<START:drug>Folic acid<END> (this will NOT match "acid" in later parsing)
I honestly wasn't expecting that to make any difference to the training
process but as it turns out it breaks it completely. Exceptions
everywhere before it even starts!!! Could please someone explain what
happens with those spaces around the entity name? How on earth can they
make any difference? I can solve my problem by doing negative lookbehind
assertion in my regex but that slows things down quite a bit! remember
i'm dealing with 6707 entries, times 383 papers. Clojure's lazy attitude
and loop/recur structure sure help a lot...
Ok now on to the 2nd problem...
Even when i manually sort all the nested tags and i finally train a
maxent model on the newly automatically annotated papers, i still get
very poor results (poorer than the dictionary) and i think i can see why
but that contradicts the openNLP documentation and all the examples i
've seen so far.
On the openNLP tutorial it seems perfectly normal to have 2 words inside
a tag like:
<START:name> Pierre Vinken <END>
but when the time comes to use the name-finder model you just trained
everything has to be tokenized again. Therefore, "Pierre Vinken" becomes
2 tokens and cannot be recognised. Sometimes just the one token may be
recognised as an entity but without the rest of the name is not only
meaningless but could be misleading. Again think about Folic acid...
Neither "folic" nor "acid" are drugs...even if the name-finder
recognises "folic", it's of no use!!! In the same way that "Vinken" is
not necessarily a name...
Anyway sorry for the massive e-mail but i'm really struggling!
Please help me, i'm at a dead-end at the moment! i've tried literally
everything...am i missing anything important?
Thanks in advance...keep up the good work!
JIm