[jira] [Commented] (UIMA-4049) The curious case of the zombie annotation

Marshall Schor (JIRA) Sun, 12 Oct 2014 13:29:46 -0700

    [ 
https://issues.apache.org/jira/browse/UIMA-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168778#comment-14168778
 ]


Marshall Schor commented on UIMA-4049:
--------------------------------------

After taking another look, I see the following.  

The Annotator index is a sorted index using 3 "keys": the begin, the end 
(features), and the type priorities (not used here).  This sorted index is not 
a "set" - in that many different elements "may"

The "remove" operation is defined to take the FS given, and uses its feature 
values to be the "keys", and does a find operation for that FS in the index.  
See FsIntArrayIndex line 410.  

What happens in this case is:
  1) token with id 21 (Frederich) is added to the index  (begin 12, end 21)
  2) token with id 25 (II.) is added to the index (begin 22, end 25)  
  3) token with id 25 is modified; the begin is changed (while it is indexed) 
to 12.
  4) The remove operation attempts to find the item to be removed.  Because it 
is looking in a sorted index, it does a binary search for the token whose begin 
is 12 and end is 21 (the token with id 21).  Find does a binary search.
       - The 2nd probe of the binary search hits token 21 (Frederich).  It 
should find that the token being searched for is > than token 21.  However, the 
token being searched for (token id 25) was modified; it's begin is now == to 
that of token 21, and its end feature is > that that of token 21.  So the 
compare incorrectly concludes that the token being searched for is earlier in 
the list. 

       - If token 25 had not had its begin value updated, the compare would 
have found that the test token was later in the list (because its begin value 
was higher.

So, bottom line, the find operation fails to find the item to be removed, and 
the remove fails. 
The reindex results in adding the modified token with id 25 into the Annotation 
index again, so it appears twice. 
 
The index looks like this: (I'm typing from the Eclipse debugger, looking at 
the cas, with "show logical structure" turned on)
{code} 
Contents of the Index:
 [0] DocumentAnnotation
 [1] Token Dies
 [2] Token flosse
 [3] Token Friedrich II.     // the reindexed item
 [4] Token Friedrich         // the failed-to-be-removed item
 [5] Token Friedrich II.     // the original indexed II. token with it's begin 
modified.
 ...
{code}

So - the correct rule for modifying anything which is added to the index which 
changes the values of any keys is:

1) remove it from the indexes (before you modify it)
2) do the modifications
3) add it back to the index (assuming you want it to be indexed again.

I modified the loop which changes the begin values to not do any updating to 
the token at that point, but just to add to a list things which needed to be 
done.  Then, later, outside of the Annotation iterator, I had the code go thru 
what needs to be done, and had the modification to the token begin values occur 
using the remove - modify - add back to indexes approach.  This worked in 
either order.

I see this method and restrictions, etc., is documented here 
http://uima.apache.org/downloads/releaseDocs/2.2.2-incubating/docs/html/references/references.html#ugr.ref.jcas.adding_removing_instances_to_indexes
 .

I do agree with you that it would be good to have some kind of automated 
checking of this; if someone can suggest a way that has minimal impact on 
correctly done code, it would be great to hear about it.
   

> The curious case of the zombie annotation
> -----------------------------------------
>
>                 Key: UIMA-4049
>                 URL: https://issues.apache.org/jira/browse/UIMA-4049
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>            Reporter: Richard Eckart de Castilho
>            Assignee: Marshall Schor
>         Attachments: CuriousTestCase.java
>
>
> When annotations are removed from indexes, sometimes they come back... the 
> following test case shows how an annotation is removed but still present when 
> iterating over the index later.
> {code}
>     @Test
>     public void testForZombies() throws Exception
>     {
>         // No zombie here
>         int[] offsets1 = { 0, 4, 5, 11, 12, 21, 22, 25, 26, 29, 30, 35, 36, 
> 40, 41, 50, 51, 60, 61,
>                 64, 64, 65 };
>         testForZombies("Dies flößte Friedrich II. für seine neue Eroberung 
> Besorgnis ein.", offsets1);
>         
>         // Zombie hiding in here
>         int[] offsets2 = { 0, 3, 4, 7, 8, 13, 14, 18, 19, 22, 23, 33, 34, 35 
> };
>         testForZombies("Ich bin Franz III. von Hammerfels !", offsets2);
>     }
>     public void testForZombies(String aText, int[] aOffsets) throws Exception
>     {
>         // Init some dictionaries we ues
>         Set<String> names = new HashSet<String>();
>         names.add("Friedrich");
>         names.add("Franz");
>         Set<String> suffix = new HashSet<String>();
>         suffix.add("II.");
>         suffix.add("III.");
>         // Set up type system
>         TypeSystemDescription tsd = new TypeSystemDescription_impl();
>         tsd.addType("Token", "", CAS.TYPE_NAME_ANNOTATION);
>         
>         // Create CAS
>         CAS jcas = CasCreationUtils.createCas(tsd, null, null);
>         jcas.setDocumentText(aText);
>         
>         Type tokenType = jcas.getTypeSystem().getType("Token");
>         Feature beginFeature = tokenType.getFeatureByBaseName("begin");
>         
>         // Create tokens in CAS
>         for (int i = 0; i < aOffsets.length; i += 2) {
>             jcas.addFsToIndexes(jcas.createAnnotation(tokenType, aOffsets[i], 
> aOffsets[i+1]));
>         }
>         
>         // List the tokens in the CAS
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             System.out.printf("Starting with %s%n", token.getCoveredText());
>         }
>         // Merge some tokens, in particular "Franz" "III." -> "Franz III." 
> and "Friedrich" "II."
>         // into "Friedrich II."
>         AnnotationFS previous = null;
>         List<AnnotationFS> toDelete = new ArrayList<>();
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             if (previous != null && names.contains(previous.getCoveredText())
>                     && suffix.contains(token.getCoveredText())) {
>                 token.setIntValue(beginFeature, previous.getBegin());
>                 toDelete.add(previous);
>             }
>             previous = token;
>         }
>         // Remove the no longer necessary tokens ("Friedrich" and "Franz"), 
> since we expanded the
>         // following tokens "III." and "II." to include their text
>         Set<String> removedWords = new HashSet<String>();
>         for (AnnotationFS token : toDelete) {
>             System.out.printf("Removing %s%n", token.getCoveredText());
>             removedWords.add(token.getCoveredText());
>             jcas.removeFsFromIndexes(token);
>         }
>         // Check if the tokens that we wanted to remove are really gone
>         for (AnnotationFS token : jcas.getAnnotationIndex(tokenType)) {
>             System.out.printf("Remaining %s%n", token.getCoveredText());
>             if (removedWords.contains(token.getCoveredText())) {
>                org.junit.Assert.fail("I saw a zombie!!!");
>             }
>         }
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (UIMA-4049) The curious case of the zombie annotation

Reply via email to