[
https://issues.apache.org/jira/browse/CTAKES-143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Miller updated CTAKES-143:
------------------------------
Attachment: dict_fix.svg
Plot of processing time for baseline (blue) and patch (magenta). Note length
in chars is x axis, y-axis is seconds.
> dictionary lookup iterates inefficiently
> ----------------------------------------
>
> Key: CTAKES-143
> URL: https://issues.apache.org/jira/browse/CTAKES-143
> Project: cTAKES
> Issue Type: Improvement
> Components: ctakes-dictionary-lookup
> Reporter: Tim Miller
> Attachments: dict.diff, dict_fix.svg
>
>
> I noticed quadratic performance on large notes (in length of file) mainly due
> to dictionary annotator. Inside DictionaryLookupAnnotator, it iterates over
> "Lookup windows." Inside this iteration, it calls getLookupTokenIterator()
> in the LookupInitializer. As implemented in
> FirstTokenPermLookupInitializerImpl, this method will create a iterator over
> all BaseTokens, pruning out certain subtypes. This is then passed back to
> the DictionaryLookupAnnotator, which will only then constrain the list to
> those tokens which overlap the lookup window. This iterating over every
> token in the document for each lookup window is extremely inefficient. This
> could be fixed by, e.g., doing the constraining first and then doing the
> pruning by sub-types.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira