I haven't finished the work yet, but soon :-).

-Marshall

On 3/20/2015 11:20 AM, Peter Klügl wrote:
> Very interesting. From the ruta perspective, one can see ruta rules as some
> kind of cascaded FSTs / atomic AEs, which iterate over specific annotations
> that have been created previously but are hardly updated/added in later
> phases. I am already very curious how your changes influence the speed of the
> ruta rule execution.
>
> Best,
>
> Peter
>
> Am 20.03.2015 um 14:50 schrieb Marshall Schor:
>> I recently helped a UIMA user speed up a particular component that had 
>> n-squared
>> CAS instance retrieval operations on Annotation subtype T (and its subtypes
>> Tsub) where there were lots of subtypes.
>>
>> UIMA indexes allow retrieving items from the CAS, and are supposed to trade 
>> off
>> space (for indexes) for time (speed of finding items in the CAS, speed of
>> iterating).
>>
>> Profiling showed a lot of time being spent in the iterators "heapify up and
>> down" methods - these are used when iterating over a type which has 
>> subtypes, to
>> select the "next" item in the sorted order which may be of a different type 
>> than
>> the current item.
>>
>> I was able to speed the annotator up by about 10x (because of the n-squared
>> algorithm - they iterate over all instances of T, and then, for each 
>> instance,
>> they iterated over the "subannotations" bounded by that instance, looking 
>> for a
>> relationship).  The speedup was done by first extracting the instances (in
>> sorted order) into a plain Java array of Java-cover-type instances (they're
>> using the JCas, so these were JCas classes) and then writing an iterator over
>> that array; this iterator of course no longer had to do the heapify up and 
>> down
>> operations.
>>
>> This got me thinking that it would not be too hard to this kind of thing
>> automatically, in core UIMA (extract into a flattened 1-level array) an index
>> for all instances of a type T (and its subtypes).  The main issue is to 
>> figure
>> out the right time to do this, because an update to the index for T or any of
>> its subtypes Tsub makes it necessary to re-compute the flattened array.  
>> Given
>> that I've seen lots of cases of UIMA flows where some annotators will create 
>> and
>> index a type (and its subtypes), and once that's been done, the indexes are 
>> not
>> subsequently updated for these types, but downstream annotators iterate over
>> them, it seems that a lazy creation for this kind of flattened index would 
>> work
>> well in many cases.
>>
>> I'm experimenting with this, and will probably add something like this to 
>> UIMA
>> in a way that should be backwards compatible, except for having iterators 
>> over
>> sorted indexes sometimes **no longer** be "fast fail"; this is already 
>> allowed
>> by the specification.  The iterator would check to see if a flattened version
>> existed, and if so, verify it was valid at the start, but would not be
>> guaranteed to notice if the index was updated (things added/ removed/ 
>> reordered)
>> while the iteration was in progress.
>>
>> I'm planning to trigger the creation of this flattened index when the number 
>> of
>> heapify up and down calls > index size (with the counters reset if the 
>> indexes
>> were updated); the creation will be done in an executor on a separate thread.
>>
>> At a high level, this would be a bit like JIT for this part of UIMA - 
>> improving
>> "indexes" operation.
>>
>> -Marshall
>>   
>
>
>

Reply via email to