[ 
https://issues.apache.org/jira/browse/CTAKES-143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Miller updated CTAKES-143:
------------------------------

    Attachment: dict_fix.svg

Plot of processing time for baseline (blue) and patch (magenta).  Note length 
in chars is x axis, y-axis is seconds.
                
> dictionary lookup iterates inefficiently
> ----------------------------------------
>
>                 Key: CTAKES-143
>                 URL: https://issues.apache.org/jira/browse/CTAKES-143
>             Project: cTAKES
>          Issue Type: Improvement
>          Components: ctakes-dictionary-lookup
>            Reporter: Tim Miller
>         Attachments: dict.diff, dict_fix.svg
>
>
> I noticed quadratic performance on large notes (in length of file) mainly due 
> to dictionary annotator.  Inside DictionaryLookupAnnotator, it iterates over 
> "Lookup windows."  Inside this iteration, it calls getLookupTokenIterator() 
> in the LookupInitializer.  As implemented in 
> FirstTokenPermLookupInitializerImpl, this method will create a iterator over 
> all BaseTokens, pruning out certain subtypes.  This is then passed back to 
> the DictionaryLookupAnnotator, which will only then constrain the list to 
> those tokens which overlap the lookup window.  This iterating over every 
> token in the document for each lookup window is extremely inefficient.  This 
> could be fixed by, e.g., doing the constraining first and then doing the 
> pruning by sub-types.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to