[jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

Marshall Schor (JIRA) Sun, 04 Aug 2013 14:36:15 -0700

    [ 
https://issues.apache.org/jira/browse/UIMA-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729006#comment-13729006
 ]


Marshall Schor commented on UIMA-3141:
--------------------------------------

I took a look at this, and it may be working as designed.

Here's what it appears is happening (I didn't run the test case (yet), just 
examined the code.

1) A CAS, sourceCas, is created, having a type system which includes a special 
type definition, DocMeta, which is a subtype of the built-in 
uima.tcas.DocumentAnnotation type.  

1a) The code makes an instance of this type, and adds it to the indexes.

2) The sourceCas's method "setDocumentLanguage" method is called. This method 
looks up to see if there is an indexed instance of this type, and finds the 
instance of the "DocMeta" type, created in 1a); it then sets that type's 
language feature to "latin".

3) The new form 6 serializer serializes out the sourceCas, using it's type 
system, so all "indexed" and reachable feature structures are serialized.

4) Now, the interesting part.  This file is deserialized, into the targetCas.  
However, that CAS has been defined without the special type DocMeta.  With form 
6, this type mismatch is allowed, and the semantics of this is that the 
deserialization process "filters" the feature structures being deserialized, so 
that only those with type definitions in the receiving CAS are deserialized, 
and the others are "skipped".

So - this results in the DocMeta feature structure instance being skipped.

I think this is why the getDocumentLanaguage call doesn't get the language set 
in the DocMeta feature structure.

If you put the DocMeta type definition into the Target Cas's type system 
description, does it change the behavior so that the getDocumentLanguage 
returns "latin"?
                
> Binary CAS format 6 + type filtering fails to deserialize document annotation 
> correctly 
> ----------------------------------------------------------------------------------------
>
>                 Key: UIMA-3141
>                 URL: https://issues.apache.org/jira/browse/UIMA-3141
>             Project: UIMA
>          Issue Type: Bug
>          Components: Core Java Framework
>    Affects Versions: 2.4.1SDK
>            Reporter: Richard Eckart de Castilho
>            Assignee: Marshall Schor
>
> When a custom document annotation type is used, the language is not properly 
> restored after deserializing from CAS format 6.
> Expected: deserialized CAS has language "latin"
> Actual: deserialized CAS has language "x-unspecified"
> If the line {{sourceCas.addFsToIndexes(ma);}} is commented out, the code 
> works.
> {code}
> import static org.junit.Assert.assertEquals;
> import static org.junit.Assert.assertTrue;
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.FileOutputStream;
> import java.io.InputStream;
> import java.io.OutputStream;
> import org.apache.commons.io.IOUtils;
> import org.apache.uima.cas.CAS;
> import org.apache.uima.cas.impl.Serialization;
> import org.apache.uima.cas.text.AnnotationFS;
> import org.apache.uima.resource.metadata.TypeSystemDescription;
> import org.apache.uima.resource.metadata.impl.TypeSystemDescription_impl;
> import org.apache.uima.util.CasCreationUtils;
> import org.junit.Rule;
> import org.junit.Test;
> import org.junit.rules.TemporaryFolder;
> public class MinimalTest
> {
>     @Rule
>     public TemporaryFolder testFolder = new TemporaryFolder();
>     @Test
>     public void test()
>         throws Exception
>     {
>         TypeSystemDescription sourceTsd = new TypeSystemDescription_impl();
>         sourceTsd.addType("DocMeta", "", CAS.TYPE_NAME_DOCUMENT_ANNOTATION);
>         TypeSystemDescription targetTsd = new TypeSystemDescription_impl();
>         CAS sourceCas = CasCreationUtils.createCas(sourceTsd, null, null);
>         AnnotationFS ma = 
> sourceCas.createAnnotation(sourceCas.getTypeSystem().getType("DocMeta"),
>                 0, 0);
>         sourceCas.addFsToIndexes(ma);
>         sourceCas.setDocumentLanguage("latin");
>         sourceCas.setDocumentText("test");
>         File file = testFolder.newFile("test.bin");
>         OutputStream os = new FileOutputStream(file);
>         Serialization.serializeWithCompression(sourceCas, os, 
> sourceCas.getTypeSystem());
>         IOUtils.closeQuietly(os);
>         assertTrue(new File(testFolder.getRoot(), "test.bin").exists());
>         CAS targetCas = CasCreationUtils.createCas(targetTsd, null, null);
>         InputStream is = new FileInputStream(file);
>         Serialization.deserializeCAS(targetCas, is, 
> sourceCas.getTypeSystem(), null);
>         IOUtils.closeQuietly(is);
>         assertEquals("latin", targetCas.getDocumentLanguage());
>         assertEquals("test", targetCas.getDocumentText());
>     }
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-3141) Binary CAS format 6 + type filtering fails to deserialize document annotation correctly

Reply via email to