Re: How to use the new binary CAS (de)serialization?

Marshall Schor Mon, 08 Jul 2013 20:32:39 -0700

On 7/8/2013 6:06 PM, Richard Eckart de Castilho wrote:
> Am 08.07.2013 um 23:49 schrieb Marshall Schor <[email protected]>:
>
>>> The documentation says:
>>>
>>>> Deserialize with type filtering:
>>>>
>>>> The reuseInfo should be null unless deserializing a delta CAS, in which 
>>>> case, it must be the reuse info captured when the original CAS was 
>>>> serialized out. If the target type system is identical to the one in the 
>>>> CAS, you may pass null for it. If a delta cas is not being received, you 
>>>> must pass null for the reuseInfo.
>>>>
>>>> Serialization.deserializeCAS(cas, bais, tgtTypeSystem, reuseInfo);
>>> So I assume that when I deserialize my persisted CAS into a fresh one which 
>>> doesn't contain any types, the only thing that should arrive is the SofA. 
>>> But, no matter what serialization format I use (0, 4, or 6), I always get 
>>> an ArrayIndexOutOfBoundsException.
>>>
>>> I create the target CAS like this:
>>>
>>>        CAS cas = CasCreationUtils.createCas((TypeSystemDescription) null, 
>>> null, null);
>>>
>>> Format 6:
>>>
>>> java.lang.ArrayIndexOutOfBoundsException: 37
>>>     at 
>>> org.apache.uima.cas.impl.TypeSystemImpl.getTypeInfo(TypeSystemImpl.java:1566)
>>>     at 
>>> org.apache.uima.cas.impl.BinaryCasSerDes6.deserializeAfterVersion(BinaryCasSerDes6.java:1701)
>>>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1203)
>>>     at org.apache.uima.cas.impl.CASImpl.reinit(CASImpl.java:1168)
>>>     at 
>>> org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:171)
>>>        …
>>>
>>> Am I misunderstanding how the (de)serialization is supposed to work?
>> Form 6 supports having different type systems.  When using this, it expects 
>> the
>> "other" type system to be passed in, as a type system impl object.  If 
>> "null" is
>> passed in, then it assumes the "other" type system is identical to the first
>> one.  (this is what the JavaDocs mean, when it says:
>>
>> If the target type system is identical to the one in the CAS, you may pass 
>> null for it. 
> In the sentence above, I assumed that "CAS" means "the which I deserialize 
> into/the target CAS"
> and that "target type system is identical" means that "I want all types 
> available in the target
> CAS to be deserialized/I do not want any types that are available in the 
> target CAS to be ignored".
Yes, I had a hard time figuring out for all the use cases how to "name" the
various parts...


In that description, "CAS" did mean "that which I deserialize into".  The
"target type system is identical" means that the type system of the CAS and the
type system of the serialized data are the same.   Basically, the serialization
/ deserialization mechanism needs to know both type systems, in order to figure
out how to decode things.
>
>> So, to make form 6 work for you, you have to do something like:
>>
>>  a) Create an instance of a type system impl for the types in your 
>> serialized form.
>> For instance, if you created a CAS with some types in it, and serialized it,
>> before you get rid of that CAS, save its type system in a variable:
>>
>>    TypeSystem tsThatWasSerialized = theCASthatWasSerialized.getTypeSystem();
>>
>> Use this type system as the argument, (not "null") when calling the form 6 
>> style deserialize:
>>
>> Serialization.deserializeCAS(cas, bais, tsThatWasSerialized, null);
>>
>> Is that something like what you did? 
> Nope, that's not what I did. I thought it was not necessary to preserve the 
> "source" type
> system. 
Well, it is for this.
> I interpreted the documentation such that "tsThatWasSerialized" was not the 
> "source"
> type system, but the "target" type system (e.g. a subset of the actual target 
> CAS type system).
I apologize for the confusion.  I'm happy to improve the documentation
(suggestions welcome).  I did struggle to find some wording that would work for
the various use-cases.
>
> Ignoring the potential waste of space, wouldn't you find it useful to 
> serialize all used 
> types of the type system as part of the format 6, thus avoiding to have to 
> maintain an
> external copy of the type system? 
Sure. But it's at a cost (space and time).  I think that there are many use
cases of this serialization (e.g., for sending CASes in UIMA-AS between nodes)
where sending the ts along is not needed.  This was the original motivation for
doing this type-mapping kind of thing.  It allows the following scenario and
efficiencies:

1) Imagine a UIMA pipeline acting as a client for a UIMA service running
remotely.  That UIMA service has some type system, let's call it:  TS_service,
that it defines for what it is, that it does. 

   Note that the same service might be used by many different UIMA Clients, each
having some different type system.

2) When the UIMA client pipeline starts up, UIMA forms a combined type system -
combining the types from the Service with those of the client.  In doing this,
the client acquires knowledge of each service's type system.  The type system
merge UIMA does at startup time would typically result in a (much) bigger type
system than TS_service.

3) When it's time to send a CAS to the service, when using form 6, the
serialization takes advantage of the knowledge of the service's TS_service, and
only sends the parts of the CAS having those types to the service.  This is in
contrast to other methods of client-service communication, which send the entire
CAS to the service.
> The CasCompleterSerializer conveniently wraps up all
> data (CAS + type system) in a single serializable object. I find that very 
> convenient.
> The only annoying part is, that it's not possible to deserialize that into a 
> CAS with
> a new type system, e.g. with some types added or removed.
I think it would be pretty easy to do a similar thing with the compressed binary
forms.
> Btw. it might be nice if deserializeCas() could not only detect the formats 
> 0, 4 and 6, but
> also serialized forms of the CasCompleterSerializer.
Another enhancement :-)
>
> Did you do any performance measures for the new serialization forms?
I think it depends pretty heavily on the kind of machine you run on.  In running
on my Intel i7 laptop, which has a multi-level L1, L2, etc. cache architecture,
it actually ran faster than plain binary serialization, on large test CASes.  I
suspect this was because it compressed these so much.  (And I was measuring
using bytearray style input/output, not writing to disk).  Needless-to-say, I
was pretty surprised by this.

-Marshall

Re: How to use the new binary CAS (de)serialization?

Reply via email to