[ 
https://issues.apache.org/jira/browse/UIMA-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836977#comment-13836977
 ] 

Marshall Schor commented on UIMA-3399:
--------------------------------------

After some discussion (on the mailing list), there's a basic question of 
whether or not there is a use for adding a FS to the indexes, multiple times.  

UIMA does explicitly offer "Bag" and "Set" indexes, with a bit of complicated 
notion for Set: It has the normal semantics that a FS won't be added to a Set 
index multiple times, if another FS (or the identical FS) that "matches" 
(according to the Set index's particular keys) is already indexed.  

The Bag is there to allow the same FS to be indexed multiple times.  Is this a 
used/useful feature? Here's one possible use case: suppose you're doing some 
kind of "sampling" to feed a machine-learning algorithm, and need to 
"oversample" some particular type - you could define a bag indexes and do 
multiple inserts of the same FS into that bag, for use by a down-stream 
annotator.



> More consistent handling of multiple add-to-index behavior for same Feature 
> Structure
> -------------------------------------------------------------------------------------
>
>                 Key: UIMA-3399
>                 URL: https://issues.apache.org/jira/browse/UIMA-3399
>             Project: UIMA
>          Issue Type: Brainstorming
>    Affects Versions: 2.4.2SDK
>            Reporter: Marshall Schor
>            Assignee: Marshall Schor
>            Priority: Minor
>
> UIMA has a somewhat unusual indexing architecture.  You can define indexes 
> (sorted, bag, set), and then add / remove a feature structure (FS) to all of 
> the defined indexes.
> The design intention (I think) was to support the concept of a FS being 
> indexed, or not.  However, the current design allows some anomalies that 
> behave inconsistently between code being run "locally", versus as remote 
> services (due to how serialization handles this).  Serialization encodes only 
> the concept of a FS being either in an index or not. 
> The problem arises in the edge case where the same FS is added to the indexes 
> multiple times.  For local (non-remote) cases, for bag and sorted indexes, 
> the same exact FS would be added multiple times.  This would have the 
> consequences:
> -  Iterating would return multiple == FSs.
> -  Remove from indexes of a multiply-added FS would reduce the number by 1; 
> the FS would still be in the index.
> For the same code, running remotely, serialization would have "collapsed" the 
> multiple additions into one, so would behave differently.
> A proposed improvement:  Change the behavior of "add-to-index" so that  
> subsequent add-to-indexes of a same FS would be either a no-op, or a delete / 
> re-add (to cover the case where some feature values of the FS might have 
> changed, and therefore leading to the need to re-index the FS).  To cover 
> users who might be exploiting the old behavior, we could have a framework 
> context flag to re-instate the older behavior.
> This would better align how code running locally or remotely works.
> What do people think about this idea?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to