I'm running the default pipeline on some large files and trying to fix
some of the slower annotators. I changed ChunkAdjuster to use UimaFit
selectors which dramatically improves speed on large files. I removed
the OverlapAnnotator, with its complicated interface and extreme
generality, from my pipeline altogether and replaced it with a 3-line
static annotator. I think we should consider doing that for the default
pipeline even if we think there are good reasons to keep the
general-purpose annotator around.
Anyways, now I'm at the dictionary lookup which I suspect will be the
slowest component. One call is to getContextMap() which seems especially
slow. It is called for every LookupWindow, and given the span of that
window, iterates over all LookupWindow's looking for one with the
equivalent span. So in the end you give it a lookup window and it gives
you the same one back basically. Of course the code is written very
generally so there may be use cases where the types are different, but
for the default case it seems a little weird for something doing nothing
to take so long.
So, my question is, does anyone know what the engineering goals of this
setup are? I think it can be optimized even within the super-general
framework it is trying to maintain, but I don't want to break anything
by making assumptions that aren't valid.
Thanks
Tim
- getContextMap() question Tim Miller
-