Re: inconsistency in implementation of SubIterators

Marshall Schor Sat, 11 Apr 2015 09:25:02 -0700

Re: confusion regarding subiterators and "type priorities". I agree that many
users have wanted a simpler version of a subiterator that just bounds the
iterator to a specific begin and end, without any reference to type priorities.


To do this, I'm thinking of an API which expresses this quite directly.  It
would be nice to be able to continue to offer the "strict" and "unambiguous"
styles as well, for "architectural generality" (although I don't really know if
there's a perceived need for this - architectural generality has a side benefit
of having users "learn" less unique special things; instead they learn some base
things which they can then combine).

I know some languages (Python?) use the term "slice" to express a subsequence
from an ordered collection.

So perhaps we can have for annotation indexes an additional method:
  slice(int begin, int end)  or
  slice(FeatureStructure fs) [ where the fs just supplies a begin / end ]

which would return a lightweight wrapper of the specific index to operate as a
subiterator.

We could also make the strict and unambiguous work this way too
  strict()
  unambiguous()

This would then permit the use of this now-specialized index in Java for (xxx :
yyy) style.

Note that this version would not "skip" anything; if users wanted to skip some
particular items, they'd need to do that in the loop.

The implementation of this could make use of additional index structures
(lists), like the current base UIMA implementations for strict and unambiguous
do.  I'd like to keep this "detail" out of the API if possible, hoping that the
decision to do this (or not) could be somehow automatic :-) .

-Marshall

On 4/10/2015 2:23 AM, Richard Eckart de Castilho wrote:
> On 09.04.2015, at 23:42, Marshall Schor <[email protected]> wrote:
>
>> In UIMA, Subiterators are defined relative to an existing Annotation Index 
>> over
>> some subtype of Annotation.
>>
>> When you create a subiterator, you pass in boundaries (begin / end) used to
>> restrict the iterator to those instances within that span.
>>
>> The boundaries are passed in using a FeatureStructure, which may be a new 
>> one,
>> or an existing one (perhaps also in the Annotation Index, but it need not 
>> be).
>>
>> When these were defined, the concept of having multiple "equal" (in the sense
>> that the defined keys - begin, end, and type prioirity order) matched between
>> two FeatureStructures), was not though of, I think.  The implementation
>> currently includes code that, when creating the iterator, does a
>> "moveTo(the_bounding_fs)" operation, and then, if it finds that the FS at 
>> that
>> spot is "equal" to the bounding FS, it moves-to-next to "skip" it.
>>
>> Extending this to the possibility of "multiple" equal FSs, the effect is
>> currently to skip just the first (of possibly many "equal" instances).
> In my experience, users have often been confused by that. They thought that
> begin/end was sufficient and that type priorities were not even needed.
>
> This confusion gave rise to the uimaFIT selectCovered(jcas, being, end) method
> which only takes offsets into account, ignores type priorities, and rewinds 
> the the first of possibly multiple with equal begin after the initial moveto
> operation.
>
>> The documentation (which is in the Javadocs, mostly, for AnnotationIndex, 
>> here
>> http://uima.apache.org/d/uimaj-2.7.0/apidocs/index.html ), doesn't cover this
>> case.  It also seems to believe that the annotation supplying the bounding
>> information needs to be in the index, whereas, the implementation doesn't
>> require that.  For instance, one could decide to get all annotations between 
>> 10
>> and 100, and just make an instance of a subtype of Annotation, setting the
>> begin/end values to 10/100, and ** never add this to the indexes **, and 
>> pass it
>> to the subiterator method as the bounding annotation.
> That is also something I often saw. The problem is, that - to my 
> understanding -
> creating such a temporary annotation consumes space in the CAS heaps even if 
> the
> annotation is never indexed.
>
>> I realize this is an edge case, that might not be too interesting, but I'd 
>> like
>> to do some kind of better implementation to cover this.  The choices seem to 
>> be
>> to a) continue skipping the 1st one, and leave the others in the iteration, 
>> or
>> b) continue skipping the 1st one, and skip all of the other "equal" ones as 
>> well.
>>
>> Another edge case happens if the bounding annotation *is* in the index.  In 
>> that
>> case the definition in the Javadocs specifies the iterator will return
>> annotations *following* the particular bounding annotation that is in the 
>> index.
>> To implement this correctly, the code would need to search all "equal" items 
>> in
>> the index to find the one that is "EQ" / == / has the same exact
>> FeatureStructure "id", and return items "following" that in the index.
>>
>> This code is not present in the current implementation; should it be added?  
>> Or
>> should we update the Javadocs?
> If compatibility is an issue, I'd be for updating the JavaDoc to reflect the 
> current
> behavior more clearly, then think about adding a new API that supports other 
> kinds
> of behaviors, e.g. in the way that uimaFIT selectCovered is handling this.
>
> I think it would be great to investigate the possibilities that the Java 8 
> stream
> API might open up. Years back, we had been contemplating in uimaFIT on an 
> alternative
> CAS "selection" API thinking into directions akin to that steam API or to the 
> Hibernate
> Critera API. It might be interesting to refer to our notes from back then. 
> Steven even
> did some initial coding:
>
> https://code.google.com/p/uimafit/issues/detail?id=65&colspec=ID%20Type%20Status%20Priority%20Milestone%20Compatible%20ASFJira%20Owner%20Summary
>
> Cheers,
>
> -- Richard

Re: inconsistency in implementation of SubIterators

Reply via email to