Hi Jim,

On 2/6/2012 12:15 PM, Jim - FooBar(); wrote:
> Hello openNLP users/developers,
>
> I really need your help with openNLP NameFinder...Let me explain:
>
> I am writing a drug-entity recogniser (NER for drugs) and i 'm using
> the openNLP API from Clojure. I don't have annotated text but i do an
> up-to-date dictionary of drugs (drugbank.xml). The dictionary includes
> all sorts of information so it needs a bit of preprocessing to extract
> the names and synonyms. Anyway my problem is twofold:
>
> As i said i don't have annotated text, but i thought maybe i can make
> one. You see, i do have 383 pharmacology papers in raw text so i
> thought why not use the names from the dictionary and build regex
> patterns to replace all occurrences of that entry with the appropriate
> "<START:drug> /drug-name/ <END>" annotation tag. Now, you may be
> wondering at this point why on earth am i  not using all the names i
> extracted from drugbank.xml to build a proper openNLP dictionary to do
> lookup...Well, apart from not being what i'm trying to do here, i've
> already tried doing that but i got poor results  simply because words
> like "Folic acid" are being tokenized as 2 tokens rather than 1 and
> thus, the entry "Folic acid" in the dictionary matches no token! Even
> if it worked thoughi would still have to train a maxent model to
> recognise drugs that may not exist in the dictionary (brand new for
> instance). My initial approach was a bit fiddly but it sort of paid of
> at the end. I now have a small program that expects some text and, for
> each entry in the dictionary (6707 in total) it finds and replaces any
> occurrences of that entry in the text with the expected openNLP format
> for training. Here is where the 1st problem happens. I can tweak my
> regex pattern to add spaces to the entity tag or not, with the
> following results :
>
> <START:drug> drug-name <END>  (with spaces inside)
>
> causes problems for me because i get nested tags. To understand why
> think about the words "Folic acid" for example. Lets assume that folic
> acid is entry 3 in the dictionary and that entry 115 is "acid". First
> time round it will produce <START:drug> Folic acid <END> but when it
> processes "acid" it will match the word acid already tagged with
> "Folic acid". You can see where this is going. If you happen to have a
> complex compound name and after a while a slightly less complex
> compound name (maybe a word shorter or something), and then a smaller
> one, they can easily start to nest, especially when dealing with drug
> names. Now the easy and straightforward solution to that is to NOT add
> spaces in the tag like this :
>
> <START:drug>Folic acid<END>   (this will NOT match "acid" in later
> parsing)
>
> I honestly wasn't expecting that to make any difference to the
> training process but as it turns out it breaks it completely.
> Exceptions everywhere before it even starts!!! Could please someone
> explain what happens with those spaces around the entity name? How on
> earth can they make any difference? I can solve my problem by doing
> negative lookbehind assertion in my regex but that slows things down
> quite a bit! remember i'm dealing with 6707 entries, times 383 papers.
> Clojure's lazy attitude and loop/recur structure sure help a lot...
I'm not sure if what you are describing is a result of Clojure or a
problem with the dictionary.  Our dictionary should be matching and
returning on the largest match of tokens.  But, it could be the method
you are using to tokenize the sentences.  Or maybe the lack there of.

What usually happens, is that we train or use a SentenceDetector to
first take the document and return the sentences as separate lines.
Then, a Tokenizer is used to parse the sentence into its individual
words and tokens or punctuation.

Example:
    This is a good sentence to "parse" on the way out.
When tokenized:
    This is a good sentence to " parse " on the way out .

The tokenizer splits everything into individual tokens to be parsed by
later models.

>
> Ok now on to the 2nd problem...
>
> Even when i manually sort all the nested tags and i  finally train a
> maxent model on the newly automatically annotated papers, i still get
> very poor results (poorer than the dictionary) and i think i can see
> why but that contradicts the openNLP documentation and all the
> examples i 've seen so far.
>
> On the openNLP tutorial it seems perfectly normal to have 2 words
> inside a tag like:
>
> <START:name> Pierre Vinken <END>
>
> but when the time comes to use the name-finder model you just trained
> everything has to be tokenized again. Therefore, "Pierre Vinken"
> becomes 2 tokens and cannot be recognised. Sometimes just the one
> token may be recognised as an entity but without the rest of the name
> is not only meaningless but could be misleading. Again think about
> Folic acid... Neither "folic" nor "acid" are drugs...even if the
> name-finder recognises "folic", it's of no use!!! In the same way that
> "Vinken" is not necessarily a name...
>
> Anyway sorry for the massive e-mail but i'm really struggling!
>
>
> Please help me, i'm at a dead-end at the moment! i've tried literally
> everything...am i missing anything important?
Have you tried the latest code for the namefinder?  We have fixed a few
deficiencies in the code that may be causing some of your problems
outside of the actual training.

Other than that, it really isn't good to just use machine parsed data to
blindly train new models.  It doesn't really work that way just yet. 
Mostly because you have to have human intervention to help correct the
errors in the detection to get training data that is valid.  Many of the
data sets we use also have this limitation in that we can't train items
like the POS tagger on some ConLL data because the POS tags for many
have been generated by machine and have not been verified for accuracy. 
We do use many of these though for the namefinder because some have
actually been hand worked to add the name information needed.

> Thanks in advance...keep up the good work!
>
> JIm
>
>
>

Reply via email to