While working on converting the plain binary serialization code for UIMA v3 work, I ran across what looks like a problem.
The plain (not compressed) binary serialization form with delta cas support is sometimes used for communicating between distributed UIMA-AS services and clients (it does require all the type systems be identical, though). The client sends a full CAS to the service, which then keeps track of changes made subsequently. When the time comes to return the CAS to the client, "delta" CAS serialization sends back just the new things created in the CAS plus any changes to existing things. The binary serialization code appears to have a bug or limitation for sending changes made to existing entries in short or long arrays; this limitation doesn't exist for boolean/byte arrays. In pseudocode - what's done for these changes is to send (1) an int : the number of changes following (2) for each change: an int representing the address into the aux heap of the item (3) for each change: a byte/short/int represent the value The bug is in line 2: for the short and long arrays, this is sent as a "short" instead of as an int; for the byte (also used for boolean arrays), this is sent as an "int" which I think is correct. This means that serialization will give wrong results if there's a change to some item in the short or long aux heaps which is indexed beyond 32767 items. I think this should be fixed; but it will "break" compatibility with any stored existing serialized form, and furthermore, for UIMA-AS transport use, both the client and the server will need coordinated updates. As a minimum, this should probably check to see if the error would be occurring (trying to serialize some change at slot > 32767, and throw an exception. If we change this to use write an "int" (instead of short), we could add a global configuration flag to disable this, too, if needed for some backward compatibility purpose. I would welcome opinions on how best to approach this... -Marshall
