RE: Dictionary Lookup algorithm

Masanz, James J. Thu, 11 Apr 2013 13:04:22 -0700

This doesn't have the detail you want, but if you haven't seen it already, you 
still might want to start with the following page, and then also re-read if 
after you read this post.
https://cwiki.apache.org/CTAKES/ctakes-30-dictionary-lookup.html


In particular note the mentions of the LookupDescriptorFile.

That page doesn't have details of the various classes such as 
FirstTokenPermutationImpl.

I believe we don't have anything better than the javadocs for 
FirstTokenPermutationImpl.
Pei generated a preview of the latest javadocs under staging:
http://ctakes.staging.apache.org/apidocs/3.0.0/


I can give a sketch of what I know about the lookup algorithms, using the 
example of FirstTokenPermutationImpl:
Suppose you are using AggregatePlaintextUMLSProcessor.xml in 
ctakes-clinical-pipeline.
After all the noun phrases are found, the LookupWindowAnnotator is used to 
create a LookupWindowAnnotation for each noun phrase.
Any overlapping LookupWindowAnnotations are merged (by MaxLookupWindows 
annotator).

Then for each LookupWindowAnnotation, the following is done;
 - for each token, look up the token in the 
   "first token" field of the dictionary.
 - if the token is found at least once, collect all dictionary
   entries that start with that token
 - for each dictionary entry:
    -- within the current LookupWindowAnnotation, but within
       n tokens to the right of the current token, try to 
       find all the other tokens from the dictionary entry.
       If they are all found, add the dictionary entry to 
       the list of hits
       A token is considered "found" if either there is 
       an exact match or a match to the normalized form 
       of the word (due to in LookupDesc*xml)
 - Then something like NamedEntityLookupConsumerImpl 
   is used to create the actual annotation within the CAS.

Since comparisons are done one token at at time, it is important that the 
dictionary be tokenized the same way that the text is being tokenized.
Since FirstTokenPermutationImpl looks out n tokens, if all the words in a 
dictionary entry of x tokens are found within a single LookupWindowAnnotation, 
where x < n and x < length of LookupWindowAnnotation, intervening words are 
allowed and ignored. And also word order is ignored, except that the first word 
must be to the left of all the other words (since the FirstTokenPermutationImpl 
algorithm looks only to the right of the current token)

The above is mostly taken from memory. And I've glossed over a number of 
details. Hopefully this at least gives an overview.

-- James

> -----Original Message-----
> From: [email protected] [mailto:dev-
> [email protected]] On Behalf Of shady
> hussein
> Sent: Thursday, April 11, 2013 9:11 AM
> To: [email protected]
> Subject: Dictionary Lookup algorithm
> 
> Dear All,
>   Is there a documentation somewhere, about how the dictionary lookup
> method works exactly ?. Of course i can check the code of
> "DirectPassThroughImpl" and "FirstTokenPermutationImpl", but i find it
> waste of time, if there is a documentation somewhere. Also i would like to
> understand how the lookupwindow annotation works. If there is some guide
> to these things. I would be very grateful
> 
> 
> Thanks,
>       Shady

RE: Dictionary Lookup algorithm

Reply via email to