This doesn't have the detail you want, but if you haven't seen it already, you still might want to start with the following page, and then also re-read if after you read this post. https://cwiki.apache.org/CTAKES/ctakes-30-dictionary-lookup.html
In particular note the mentions of the LookupDescriptorFile. That page doesn't have details of the various classes such as FirstTokenPermutationImpl. I believe we don't have anything better than the javadocs for FirstTokenPermutationImpl. Pei generated a preview of the latest javadocs under staging: http://ctakes.staging.apache.org/apidocs/3.0.0/ I can give a sketch of what I know about the lookup algorithms, using the example of FirstTokenPermutationImpl: Suppose you are using AggregatePlaintextUMLSProcessor.xml in ctakes-clinical-pipeline. After all the noun phrases are found, the LookupWindowAnnotator is used to create a LookupWindowAnnotation for each noun phrase. Any overlapping LookupWindowAnnotations are merged (by MaxLookupWindows annotator). Then for each LookupWindowAnnotation, the following is done; - for each token, look up the token in the "first token" field of the dictionary. - if the token is found at least once, collect all dictionary entries that start with that token - for each dictionary entry: -- within the current LookupWindowAnnotation, but within n tokens to the right of the current token, try to find all the other tokens from the dictionary entry. If they are all found, add the dictionary entry to the list of hits A token is considered "found" if either there is an exact match or a match to the normalized form of the word (due to in LookupDesc*xml) - Then something like NamedEntityLookupConsumerImpl is used to create the actual annotation within the CAS. Since comparisons are done one token at at time, it is important that the dictionary be tokenized the same way that the text is being tokenized. Since FirstTokenPermutationImpl looks out n tokens, if all the words in a dictionary entry of x tokens are found within a single LookupWindowAnnotation, where x < n and x < length of LookupWindowAnnotation, intervening words are allowed and ignored. And also word order is ignored, except that the first word must be to the left of all the other words (since the FirstTokenPermutationImpl algorithm looks only to the right of the current token) The above is mostly taken from memory. And I've glossed over a number of details. Hopefully this at least gives an overview. -- James > -----Original Message----- > From: [email protected] [mailto:dev- > [email protected]] On Behalf Of shady > hussein > Sent: Thursday, April 11, 2013 9:11 AM > To: [email protected] > Subject: Dictionary Lookup algorithm > > Dear All, > Is there a documentation somewhere, about how the dictionary lookup > method works exactly ?. Of course i can check the code of > "DirectPassThroughImpl" and "FirstTokenPermutationImpl", but i find it > waste of time, if there is a documentation somewhere. Also i would like to > understand how the lookupwindow annotation works. If there is some guide > to these things. I would be very grateful > > > Thanks, > Shady
