Problem with openNLP Name Finder API....

Jim - FooBar(); Mon, 06 Feb 2012 09:15:39 -0800

Hello openNLP users/developers,

I really need your help with openNLP NameFinder...Let me explain:

I am writing a drug-entity recogniser (NER for drugs) and i 'm using theopenNLP API from Clojure. I don't have annotated text but i do anup-to-date dictionary of drugs (drugbank.xml). The dictionary includesall sorts of information so it needs a bit of preprocessing to extractthe names and synonyms. Anyway my problem is twofold:

As i said i don't have annotated text, but i thought maybe i can makeone. You see, i do have 383 pharmacology papers in raw text so i thoughtwhy not use the names from the dictionary and build regex patterns toreplace all occurrences of that entry with the appropriate "<START:drug>/drug-name/ <END>" annotation tag. Now, you may be wondering at thispoint why on earth am i not using all the names i extracted fromdrugbank.xml to build a proper openNLP dictionary to do lookup...Well,apart from not being what i'm trying to do here, i've already trieddoing that but i got poor results simply because words like "Folicacid" are being tokenized as 2 tokens rather than 1 and thus, the entry"Folic acid" in the dictionary matches no token! Even if it workedthoughi would still have to train a maxent model to recognise drugs thatmay not exist in the dictionary (brand new for instance). My initialapproach was a bit fiddly but it sort of paid of at the end. I now havea small program that expects some text and, for each entry in thedictionary (6707 in total) it finds and replaces any occurrences of thatentry in the text with the expected openNLP format for training. Here iswhere the 1st problem happens. I can tweak my regex pattern to addspaces to the entity tag or not, with the following results :


<START:drug> drug-name <END>  (with spaces inside)

causes problems for me because i get nested tags. To understand whythink about the words "Folic acid" for example. Lets assume that folicacid is entry 3 in the dictionary and that entry 115 is "acid". Firsttime round it will produce <START:drug> Folic acid <END> but when itprocesses "acid" it will match the word acid already tagged with "Folicacid". You can see where this is going. If you happen to have a complexcompound name and after a while a slightly less complex compound name(maybe a word shorter or something), and then a smaller one, they caneasily start to nest, especially when dealing with drug names. Now theeasy and straightforward solution to that is to NOT add spaces in thetag like this :


<START:drug>Folic acid<END>   (this will NOT match "acid" in later parsing)

I honestly wasn't expecting that to make any difference to the trainingprocess but as it turns out it breaks it completely. Exceptionseverywhere before it even starts!!! Could please someone explain whathappens with those spaces around the entity name? How on earth can theymake any difference? I can solve my problem by doing negative lookbehindassertion in my regex but that slows things down quite a bit! rememberi'm dealing with 6707 entries, times 383 papers. Clojure's lazy attitudeand loop/recur structure sure help a lot...


Ok now on to the 2nd problem...

Even when i manually sort all the nested tags and i finally train amaxent model on the newly automatically annotated papers, i still getvery poor results (poorer than the dictionary) and i think i can see whybut that contradicts the openNLP documentation and all the examples i've seen so far.

On the openNLP tutorial it seems perfectly normal to have 2 words insidea tag like:


<START:name> Pierre Vinken <END>

but when the time comes to use the name-finder model you just trainedeverything has to be tokenized again. Therefore, "Pierre Vinken" becomes2 tokens and cannot be recognised. Sometimes just the one token may berecognised as an entity but without the rest of the name is not onlymeaningless but could be misleading. Again think about Folic acid...Neither "folic" nor "acid" are drugs...even if the name-finderrecognises "folic", it's of no use!!! In the same way that "Vinken" isnot necessarily a name...


Anyway sorry for the massive e-mail but i'm really struggling!

Please help me, i'm at a dead-end at the moment! i've tried literallyeverything...am i missing anything important?

Thanks in advance...keep up the good work!

JIm

Problem with openNLP Name Finder API....

Reply via email to