Some experimental work speeding up UIMA iterators over Sorted indexes (like the annotation index)

Marshall Schor Fri, 20 Mar 2015 06:51:06 -0700

I recently helped a UIMA user speed up a particular component that had n-squared
CAS instance retrieval operations on Annotation subtype T (and its subtypes
Tsub) where there were lots of subtypes.


UIMA indexes allow retrieving items from the CAS, and are supposed to trade off
space (for indexes) for time (speed of finding items in the CAS, speed of
iterating). 

Profiling showed a lot of time being spent in the iterators "heapify up and
down" methods - these are used when iterating over a type which has subtypes, to
select the "next" item in the sorted order which may be of a different type than
the current item.

I was able to speed the annotator up by about 10x (because of the n-squared
algorithm - they iterate over all instances of T, and then, for each instance,
they iterated over the "subannotations" bounded by that instance, looking for a
relationship).  The speedup was done by first extracting the instances (in
sorted order) into a plain Java array of Java-cover-type instances (they're
using the JCas, so these were JCas classes) and then writing an iterator over
that array; this iterator of course no longer had to do the heapify up and down
operations.

This got me thinking that it would not be too hard to this kind of thing
automatically, in core UIMA (extract into a flattened 1-level array) an index
for all instances of a type T (and its subtypes).  The main issue is to figure
out the right time to do this, because an update to the index for T or any of
its subtypes Tsub makes it necessary to re-compute the flattened array.  Given
that I've seen lots of cases of UIMA flows where some annotators will create and
index a type (and its subtypes), and once that's been done, the indexes are not
subsequently updated for these types, but downstream annotators iterate over
them, it seems that a lazy creation for this kind of flattened index would work
well in many cases.

I'm experimenting with this, and will probably add something like this to UIMA
in a way that should be backwards compatible, except for having iterators over
sorted indexes sometimes **no longer** be "fast fail"; this is already allowed
by the specification.  The iterator would check to see if a flattened version
existed, and if so, verify it was valid at the start, but would not be
guaranteed to notice if the index was updated (things added/ removed/ reordered)
while the iteration was in progress.

I'm planning to trigger the creation of this flattened index when the number of
heapify up and down calls > index size (with the counters reset if the indexes
were updated); the creation will be done in an executor on a separate thread. 

At a high level, this would be a bit like JIT for this part of UIMA - improving
"indexes" operation.

-Marshall

Some experimental work speeding up UIMA iterators over Sorted indexes (like the annotation index)

Reply via email to