Hi,

i am currently looking for a good approach to store a lot of CAS data. What I 
want to do is to annotate a lot of text with basic annotations and save that. 
Then, I can read the CAS objects with these basic annotations and don't have to 
do them over and over because they are basically never changing. However, 
"basic" does not necessarily mean that the computation is fast - that's why I 
want the storage.

No I consideres binary storage because its fast and the resulting files not 
very big compared to XMI serialization. But I have the requirement that I want 
to be able to extend the type system (add features and types) with rendering 
the stored CAS objects useless.

I experimented with CASCompleteSerializer which of course does not offer this 
flexibility (but I still wanted to see like it works). Now I was hoping, when I 
used CASSerializer, I would perhaps get the flexibility I want.

I serialize with

ByteArrayOutputStream baos = new ByteArrayOutputStream();
Serialization.serializeCAS(aJCas.getCas(), baos);

and  I deserialize with

byte[] casData = ...
Serialization.deserializeCAS(aCAS, new ByteArrayInputStream(casData));

What DID work is when I add a feature to a serialized type, I can use the 
feature after deserialization (that was not possible with 
CASCompleteSerializer). But when I add a new type which was not part of the 
serialization, something odd happens: The AnalysisEngines seem to work fine. I 
can read annotations which had been serialized before and I can add new ones 
and read them again, too.
However, when I want to store the final result as an XMI (I did this for usage 
with the annotationViewer), I get an error for the XMI serialization. The XMI 
serialization is done by

FileOutputStream out = new FileOutputStream(outFile);
XmiCasSerializer.serialize(aCas, out);
out.close();

which worked always fine. The error is

Caused by: java.lang.IndexOutOfBoundsException: Index: 59, Size: 52
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at 
org.apache.uima.cas.impl.StringHeap.getStringForCode(StringHeap.java:150)
        at org.apache.uima.cas.impl.CASImpl.getStringForCode(CASImpl.java:2139)
        at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFeatures(XmiCasSerializer.java:892)
        at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:753)
        at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
        at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
        at 
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
        at 
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1567)
        at 
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1638)
        at 
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1585)
        at 
de.julielab.jules.consumer.CasToXmiConsumer.writeXmi(CasToXmiConsumer.java:338)
        at 
de.julielab.jules.consumer.CasToXmiConsumer.processCas(CasToXmiConsumer.java:288)
        at 
org.apache.uima.analysis_engine.impl.compatibility.CasConsumerAdapter.process(CasConsumerAdapter.java:99)
        at 
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:375)
        ... 4 more

Is this behaviour expected or did I just miss something? I don't really need 
the XMI serialization in my use case but I'm not too confident in the whole 
storage procedure when such an error happens.

Thanks for any hints,

Erik

Reply via email to