Hi,

I changed CasIOUtils to use the Header and I extended the header with a
bit (0x08) indicating an included type system. No information about the
serialization of the type system yet. The java-serialized formats now
have also a binary header as I did not want to make the header
serializable as it should be read/written by the same functionality.

I have thought that old UIMA versions (e.g., 2.8.1) should be able to
load new CAS files, but my tests failed.  No idea yet why. I am overall
not very happy with the current solution, but I could live with it.

Maybe someone wants to take a look at it?


Best,

Peter

Am 20.07.2016 um 14:30 schrieb Peter Klügl:
> Hi,
>
>
> I'll try to find the time to do these changes this week, next week latest.
>
>
> btw, input stream sniffing in order to distinguish XMI and XCAS is
> currently not supported. There could be a lot of text before the
> relevant element occurs, e.g., license text.
>
>
> Best,
>
>
> Peter
>
>
> Am 20.07.2016 um 14:19 schrieb Marshall Schor:
>> Hi,
>>
>> We can change the header, but:
>>
>> The changed header ought to be "readable" by previous versions of UIMA.  
>>
>> For XMI and XCAS, these do not currently have special headers, and if we 
>> added
>> these, those formats could not be read by older versions of UIMA.  Those 
>> formats
>> contain sufficient distinguishing initial strings to distinguish them, 
>> though. 
>>
>> The XMI format is specified, also, in an OASIS standard which the UIMA 
>> project
>> is said to (mostly) follow: http://uima.apache.org/uima-specification.html
>>
>> For binary serializations, I think there's room in the header for an extra 
>> bit,
>> which if on, could indicate that a type system was included.  I think it 
>> would
>> be good to have a header extension, when type systems are included, to 
>> specify
>> the format and version of the type system serialization.
>>
>> Most serializations in core UIMA have not included the type system.  The one
>> which does is CASCompleteSerializer.  This is  a "serializable" (using 
>> standard
>> Java serializations) object containing serializable forms of the CAS and Type
>> System.
>>
>> Regarding making methods in CommonSerDes public:
>>
>> It is fine to make them public in the sense that they are accessible from 
>> other
>> packages, not in a sub-type hierarchy.  But I think it is best to not include
>> CommonSerDes in a package which is intended for end-users, because the end 
>> user
>> UIMA APIs should be (as much as possible) stable over a long time period. 
>> Details of how we evolve headers, etc., should not disturb end users, if
>> possible; keeping these as public but in packages with names like xxx.impl or
>> xyz.internal.abc etc. is the way this has been traditionally done.  It 
>> allows us
>> to evolve these without affecting end-user APIs.  
>>
>> Just to be clear: I would not consider uimaFIT and Ruta to be "end-users", as
>> they are developed within the UIMA project, and we are willing to evolve them
>> together with UIMA core changes.
>>
>> We don't have a deadline for the next release, but it's mostly ready to go, 
>> and
>> will solve a significant issue for people wanting to upgrade their Eclipse to
>> Neon :-). 
>>
>> -Marshall
>>
>> On 7/20/2016 5:03 AM, Peter Klügl wrote:
>>> Ok, after looking at the code I must admit that there is much more to do
>>> than I epxected. We first need to discuss several things:
>>>
>>> - can we change the header at all?
>>>
>>> - do we support type system inclusion in the header?
>>>
>>> - do we support type system inclusion in the serialized files?
>>>
>>> - which serial format are which ones?
>>>
>>> - can we make the methods in CommonSerDes public?
>>>
>>>
>>> What is the deadline for the release? I am now quite loaded with work
>>> until next Wednesday :-(
>>>
>>>
>>> Best,
>>>
>>>
>>> Peter
>>>
>>>
>>> Am 19.07.2016 um 22:39 schrieb Marshall Schor:
>>>> Great.
>>>>
>>>> There's now also common code for writing / reading UIMA serialization 
>>>> headers, in
>>>>
>>>> CommonSerDes (in org.apache.uima.cas.impl )
>>>>
>>>> This includes the extensions to support versioning the serializations, 
>>>> which
>>>> start to be needed in the next release because a bug fix is slightly 
>>>> changing
>>>> the serialized form for **delta binary** CAS.
>>>>
>>>> So, it would be good to use that rather than have another separate header
>>>> reader/writer to maintain.
>>>>
>>>> -Marshall
>>>>
>>>>
>>>> On 7/19/2016 4:13 PM, Peter Klügl wrote:
>>>>> Ah, I didn't know that enum. I'll adapt the code and enum.
>>>>>
>>>>> Am 19.07.2016 um 20:09 schrieb Marshall Schor:
>>>>>> We already have an enum in the core for various serial formats.  The 
>>>>>> class is
>>>>>>
>>>>>> public enum SerialFormat {
>>>>>>    UNKNOWN,
>>>>>>    XCAS,         // with reachability filtering
>>>>>>    XMI,          // with reachability filtering
>>>>>>    BINARY,       // no filtering
>>>>>>    COMPRESSED,   // no filtering  (form 4)
>>>>>>    COMPRESSED_FILTERED,   // with reachability and type and feature 
>>>>>> filtering
>>>>>> (form 6)
>>>>>>    COMPRESSED_PROJECTION, // with subset of views
>>>>>> }
>>>>>>
>>>>>> (I don't think COMPRESSED_PROJECTION is in use...)
>>>>>>
>>>>>> This has been around for maybe 3 years.  I would be in favor of 
>>>>>> considering
>>>>>> using and/or extending this as needed, rather than having two formats 
>>>>>> (that is,
>>>>>> the proposed SerializationFormat class).
>>>>>>
>>>>>> -Marshall
>>>>>>
>>>>>> On 7/19/2016 2:49 AM, Peter Klügl wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> yes, the class should be officially available to external code. I
>>>>>>> already included it in the CAS Editor and in Ruta. I also plan to use it
>>>>>>> in our inhouse code. I'll change the enforcer rule.
>>>>>>>
>>>>>>>
>>>>>>> I can write the docs but any help is welcome since I do not know how
>>>>>>> much spare time I have for the rest of the week for this. I'll take a
>>>>>>> look where the documentation should be added. Haven't looked to it for
>>>>>>> some time ;-)
>>>>>>>
>>>>>>>
>>>>>>> I just chose the name of the class Richard contributed since I thought
>>>>>>> it is really suitable. Then, I also noticed the uimaFIT class. This is a
>>>>>>> not really good situation, but I would not change the name because of 
>>>>>>> it.
>>>>>>>
>>>>>>>
>>>>>>> I would not split the API form the implementation. I do not see any
>>>>>>> advantages right now. The class is just a simple utils class with only
>>>>>>> static methods like CasCreationUtils (which is also not separated).
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>>> Am 18.07.2016 um 22:26 schrieb Marshall Schor:
>>>>>>>> This is OK with me.  I can even volunteer to write the docs (but am 
>>>>>>>> happy to
>>>>>>>> others do it :-) ).
>>>>>>>>
>>>>>>>> I'll wait to hear about the split (if any) between the public API and 
>>>>>>>> the
>>>>>>>> impl.
>>>>>>>>
>>>>>>>> And, we'll need to change the next version # to 2.9.0, from 2.8.2, due 
>>>>>>>> to this
>>>>>>>> being that kind of a change.
>>>>>>>>
>>>>>>>> Is everyone OK with all of this?
>>>>>>>>
>>>>>>>> -Marshall
>>>>>>>>
>>>>>>>> On 7/18/2016 2:39 PM, Richard Eckart de Castilho wrote:
>>>>>>>>> I believe the intention is that this class becomes part of the public 
>>>>>>>>> API.
>>>>>>>>>
>>>>>>>>> Also, my understanding is that it would do a superset of what the
>>>>>>>>> uimaFIT class by the same name does. We could then probably deprecate
>>>>>>>>> the respective uimaFIT class and suggest using the core class instead.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> -- Richard
>>>>>>>>>
>>>>>>>>>> On 18.07.2016, at 20:30, Marshall Schor <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> This is a new class added to uimaj-core project, in 
>>>>>>>>>> org.apache.uima.util
>>>>>>>>>> package.  This is fine if this is to be part of the official public 
>>>>>>>>>> APIs
>>>>>>>>>> supported by UIMA going forward; but if that is the case, it should
>>>>>>>>>> probably be
>>>>>>>>>> documented in the UIMA docs, and we'd have to change the version 
>>>>>>>>>> number
>>>>>>>>>> (due to
>>>>>>>>>> enforcer rules).
>>>>>>>>>>
>>>>>>>>>> If this is more of an internal use utilities, then it should be in 
>>>>>>>>>> one of
>>>>>>>>>> the
>>>>>>>>>> internal use packages, such as
>>>>>>>>>>
>>>>>>>>>>    org.apache.uima.internal.util
>>>>>>>>>>
>>>>>>>>>> This class is similarly named to a UIMAFit class; are these related?
>>>>>>>>>>
>>>>>>>>>> If some of the APIs are to be permanent and public and part of the 
>>>>>>>>>> official
>>>>>>>>>> public APIs, but some are internal implementation details, please
>>>>>>>>>> consider using
>>>>>>>>>> an interface and an ".impl" (or equivalent) approach; packages which 
>>>>>>>>>> support
>>>>>>>>>> these are:
>>>>>>>>>>
>>>>>>>>>>    org.apache.uima.util  and
>>>>>>>>>>
>>>>>>>>>>    org.apache.uima.util.impl
>>>>>>>>>>
>>>>>>>>>> --------------
>>>>>>>>>>
>>>>>>>>>> If this is only an internal kind of change, not intending to affect 
>>>>>>>>>> the
>>>>>>>>>> official
>>>>>>>>>> UIMA APIs, then moving to the internal.util package will fix the 
>>>>>>>>>> "enforcer"
>>>>>>>>>> error the build is currently getting.
>>>>>>>>>>
>>>>>>>>>> -Marshall
>>>>>>>>>>

Reply via email to