Re: versioning cas serializations

Richard Eckart de Castilho Wed, 13 Jan 2016 12:57:03 -0800

Hi,

On 13.01.2016, at 21:28, Marshall Schor <[email protected]> wrote:
> I would turn this on for the repaired binary delta format, and supply a 
> version
> number.
> 
> Our current compressed formats use "1" as the incrementing version number.
>


> I'm leaning toward something simple, such as using the Major/Minor/Patch 
> format,
> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
> possibilities for each (more than I've ever seen used).

+1 for versioning the CAS formats. Every data format should include version 
information :) The BinaryCasWriter in DKPro Core uses 'D', 'K', 'P', 'r', 'o', 
'1' as the header for the 6+ format (serialization with compression form 6  
prepended with type system information).

Is it really necessary to have a complex versioning scheme for data formats? 
I'd rather tend towards a plain int versioning: 1, 2, 3, 4, etc. wouldn't that 
be sufficient?

> The "semantic versioning" standard has sparked some push-back (see
> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
> basically saying the "mechanical" approach of semantic versioning isn't rich
> enough for the grey areas of real world use, and ends up obscuring the purpose
> of indicating how "far" one version is from another. 


Regarding SemVer: I don't personally fully trust the plugin we are using. E.g. 
I tried doing some changes to uimaFIT that I believe are backwards-compatible 
but the semver plugin believes otherwise. 

Other than that, I am not quite convinced of the criticism towards semver 
either. 

Let's just consider (for software):

- if we do bug-fixes, we typically make this a x.y.+1 - bug-fixes shouldn't 
change the API - sounds reasonable to me

- when adding new features, I would personally always tend towards a x.+1.0 - 
in the past, we had various UIMA releases that added cool new feature but 
increased the version only at the last digit. Undeserved, I think. Since we use 
semver, we increase the middle digit more and I think that is very appropriate 
and reflects the activity in the project much better.

- that leaves the first digit, which IMHO is often a marketing digit: increase 
it to tell people that all is new and shiny and they should have another fresh 
look at the project. I don't think we need that. Using it to indicate major 
breaking changes (which are typically part of a major refactoring with cool new 
features that people should have a look at) seems quite appropriate to me. We 
are now in UIMA 2. UIMA 1 was IBM UIMA. I do believe that if we are introducing 
major changes now like a completely new CAS, that warrants going to UIMA 3.

So looking at that and minus some doubts that I have about the accuracy of the 
semver plugin, I believe that the idea of semver in general is quite sensible - 
at least when going with a three-part versioning scheme. I would consider the 
plugin as an automatic alert for accidentally introducing incompatible changes 
and the semver idea
as a guideline. When we consider it a good idea, I think we should add 
exceptions and overrides to the plugin
for particular releases. 

Cheers,

-- Richard

> On 13.01.2016, at 21:28, Marshall Schor <[email protected]> wrote:
> 
> Hi,
> 
> I'm working on UIMA-4743 - fixing some binary cas serialization problems, 
> which
> will unfortunately make the binary serialization for "delta" formats not
> backward compatible (the fix may have extra bytes in it).
> 
> We currently have a partially architected scheme for serialization forms, 
> which
> looks like:
>  - 1 word encoding U + I + M + A and also serving to identify byte order
>  - 1 word for bit-encoding some categorizations:
>     -- a bit for delta / non delta
>     -- a bit for compressed / non compressed
>  - 0 or 1 additional word for incrementing in some fashion a version number 
> for
> a particular serialization category (named below as "2nd version word)
> 
> This 2nd version word is currently only used with compressed serialization 
> formats.
> 
> I'm thinking of assigning another bit in the first word to indicate there's a
> 2nd version word present.
> 
> I would turn this on for the repaired binary delta format, and supply a 
> version
> number.
> 
> Our current compressed formats use "1" as the incrementing version number.
> 
> Thinking ahead, perhaps the serialization formats should have a multi-part 2nd
> version word, along some standards. 
> The "semantic versioning" standard has sparked some push-back (see
> https://gist.github.com/jashkenas/cbd2b088e20279ae2c8e )
> basically saying the "mechanical" approach of semantic versioning isn't rich
> enough for the grey areas of real world use, and ends up obscuring the purpose
> of indicating how "far" one version is from another. 
> 
> I'm leaning toward something simple, such as using the Major/Minor/Patch 
> format,
> each value 1 byte, in the 3 lower bytes of the 2nd version word, giving 256
> possibilities for each (more than I've ever seen used).
> 
> Other ideas?
> 
> -Marshall

Re: versioning cas serializations

Reply via email to