I recently helped a UIMA user speed up a particular component that had n-squared CAS instance retrieval operations on Annotation subtype T (and its subtypes Tsub) where there were lots of subtypes.
UIMA indexes allow retrieving items from the CAS, and are supposed to trade off space (for indexes) for time (speed of finding items in the CAS, speed of iterating). Profiling showed a lot of time being spent in the iterators "heapify up and down" methods - these are used when iterating over a type which has subtypes, to select the "next" item in the sorted order which may be of a different type than the current item. I was able to speed the annotator up by about 10x (because of the n-squared algorithm - they iterate over all instances of T, and then, for each instance, they iterated over the "subannotations" bounded by that instance, looking for a relationship). The speedup was done by first extracting the instances (in sorted order) into a plain Java array of Java-cover-type instances (they're using the JCas, so these were JCas classes) and then writing an iterator over that array; this iterator of course no longer had to do the heapify up and down operations. This got me thinking that it would not be too hard to this kind of thing automatically, in core UIMA (extract into a flattened 1-level array) an index for all instances of a type T (and its subtypes). The main issue is to figure out the right time to do this, because an update to the index for T or any of its subtypes Tsub makes it necessary to re-compute the flattened array. Given that I've seen lots of cases of UIMA flows where some annotators will create and index a type (and its subtypes), and once that's been done, the indexes are not subsequently updated for these types, but downstream annotators iterate over them, it seems that a lazy creation for this kind of flattened index would work well in many cases. I'm experimenting with this, and will probably add something like this to UIMA in a way that should be backwards compatible, except for having iterators over sorted indexes sometimes **no longer** be "fast fail"; this is already allowed by the specification. The iterator would check to see if a flattened version existed, and if so, verify it was valid at the start, but would not be guaranteed to notice if the index was updated (things added/ removed/ reordered) while the iteration was in progress. I'm planning to trigger the creation of this flattened index when the number of heapify up and down calls > index size (with the counters reset if the indexes were updated); the creation will be done in an executor on a separate thread. At a high level, this would be a bit like JIT for this part of UIMA - improving "indexes" operation. -Marshall
