Hi,
i am currently looking for a good approach to store a lot of CAS data. What I
want to do is to annotate a lot of text with basic annotations and save that.
Then, I can read the CAS objects with these basic annotations and don't have to
do them over and over because they are basically never changing. However,
"basic" does not necessarily mean that the computation is fast - that's why I
want the storage.
No I consideres binary storage because its fast and the resulting files not
very big compared to XMI serialization. But I have the requirement that I want
to be able to extend the type system (add features and types) with rendering
the stored CAS objects useless.
I experimented with CASCompleteSerializer which of course does not offer this
flexibility (but I still wanted to see like it works). Now I was hoping, when I
used CASSerializer, I would perhaps get the flexibility I want.
I serialize with
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Serialization.serializeCAS(aJCas.getCas(), baos);
and I deserialize with
byte[] casData = ...
Serialization.deserializeCAS(aCAS, new ByteArrayInputStream(casData));
What DID work is when I add a feature to a serialized type, I can use the
feature after deserialization (that was not possible with
CASCompleteSerializer). But when I add a new type which was not part of the
serialization, something odd happens: The AnalysisEngines seem to work fine. I
can read annotations which had been serialized before and I can add new ones
and read them again, too.
However, when I want to store the final result as an XMI (I did this for usage
with the annotationViewer), I get an error for the XMI serialization. The XMI
serialization is done by
FileOutputStream out = new FileOutputStream(outFile);
XmiCasSerializer.serialize(aCas, out);
out.close();
which worked always fine. The error is
Caused by: java.lang.IndexOutOfBoundsException: Index: 59, Size: 52
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at
org.apache.uima.cas.impl.StringHeap.getStringForCode(StringHeap.java:150)
at org.apache.uima.cas.impl.CASImpl.getStringForCode(CASImpl.java:2139)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFeatures(XmiCasSerializer.java:892)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeFS(XmiCasSerializer.java:753)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.encodeIndexed(XmiCasSerializer.java:700)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.serialize(XmiCasSerializer.java:268)
at
org.apache.uima.cas.impl.XmiCasSerializer$XmiCasDocSerializer.access$700(XmiCasSerializer.java:108)
at
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1567)
at
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1638)
at
org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:1585)
at
de.julielab.jules.consumer.CasToXmiConsumer.writeXmi(CasToXmiConsumer.java:338)
at
de.julielab.jules.consumer.CasToXmiConsumer.processCas(CasToXmiConsumer.java:288)
at
org.apache.uima.analysis_engine.impl.compatibility.CasConsumerAdapter.process(CasConsumerAdapter.java:99)
at
org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:375)
... 4 more
Is this behaviour expected or did I just miss something? I don't really need
the XMI serialization in my use case but I'm not too confident in the whole
storage procedure when such an error happens.
Thanks for any hints,
Erik