I haven't finished the work yet, but soon :-). -Marshall
On 3/20/2015 11:20 AM, Peter Klügl wrote: > Very interesting. From the ruta perspective, one can see ruta rules as some > kind of cascaded FSTs / atomic AEs, which iterate over specific annotations > that have been created previously but are hardly updated/added in later > phases. I am already very curious how your changes influence the speed of the > ruta rule execution. > > Best, > > Peter > > Am 20.03.2015 um 14:50 schrieb Marshall Schor: >> I recently helped a UIMA user speed up a particular component that had >> n-squared >> CAS instance retrieval operations on Annotation subtype T (and its subtypes >> Tsub) where there were lots of subtypes. >> >> UIMA indexes allow retrieving items from the CAS, and are supposed to trade >> off >> space (for indexes) for time (speed of finding items in the CAS, speed of >> iterating). >> >> Profiling showed a lot of time being spent in the iterators "heapify up and >> down" methods - these are used when iterating over a type which has >> subtypes, to >> select the "next" item in the sorted order which may be of a different type >> than >> the current item. >> >> I was able to speed the annotator up by about 10x (because of the n-squared >> algorithm - they iterate over all instances of T, and then, for each >> instance, >> they iterated over the "subannotations" bounded by that instance, looking >> for a >> relationship). The speedup was done by first extracting the instances (in >> sorted order) into a plain Java array of Java-cover-type instances (they're >> using the JCas, so these were JCas classes) and then writing an iterator over >> that array; this iterator of course no longer had to do the heapify up and >> down >> operations. >> >> This got me thinking that it would not be too hard to this kind of thing >> automatically, in core UIMA (extract into a flattened 1-level array) an index >> for all instances of a type T (and its subtypes). The main issue is to >> figure >> out the right time to do this, because an update to the index for T or any of >> its subtypes Tsub makes it necessary to re-compute the flattened array. >> Given >> that I've seen lots of cases of UIMA flows where some annotators will create >> and >> index a type (and its subtypes), and once that's been done, the indexes are >> not >> subsequently updated for these types, but downstream annotators iterate over >> them, it seems that a lazy creation for this kind of flattened index would >> work >> well in many cases. >> >> I'm experimenting with this, and will probably add something like this to >> UIMA >> in a way that should be backwards compatible, except for having iterators >> over >> sorted indexes sometimes **no longer** be "fast fail"; this is already >> allowed >> by the specification. The iterator would check to see if a flattened version >> existed, and if so, verify it was valid at the start, but would not be >> guaranteed to notice if the index was updated (things added/ removed/ >> reordered) >> while the iteration was in progress. >> >> I'm planning to trigger the creation of this flattened index when the number >> of >> heapify up and down calls > index size (with the counters reset if the >> indexes >> were updated); the creation will be done in an executor on a separate thread. >> >> At a high level, this would be a bit like JIT for this part of UIMA - >> improving >> "indexes" operation. >> >> -Marshall >> > > >
