Decimal schema evolution.

Andy Coates Tue, 26 Jun 2018 05:40:44 -0700

Hi everyone,

I wanted to start a thread of discussion around the current Decimal logical
type and the weaknesses in the current implementations, (well, the Java one
at least, as that's what I'm using).


There are a couple of Jiras already around this subject:
- AVRO-2078 Avro does not enforce schema resolution rules for Decimal type
<https://issues.apache.org/jira/browse/AVRO-2078> and associated PR
<https://github.com/apache/avro/pull/247>
- AVRO-1721 Should LogicalTypes introduce schema (in)compatibility and
canonical parsing form changes?
<https://issues.apache.org/jira/browse/AVRO-1721>
- AVRO-2164 Make Decimal a first class type.
<https://issues.apache.org/jira/browse/AVRO-2164>

*Avoiding data corruption with Decimal types.*
Basically, the main issue seems to be that deserialising a record, where
the read and write schemas have decimal types with different scale, leads
to data corruption. e.g. given the decimal '12.34' that is serialised with
a write schema that has scale 2, when the decimal is deserialised with a
read schema that has scale 3, the resulting decimal is '1.234').

Currently, the Java implementation, (and likely others), does not see read
and write schemas with different scale as being incompatible. The decimal
*logical* type is not deemed part of the normalised form of the schema and
is ignored.

Schemas evolve over time. Where a scale of 2 may initial be deemed enough,
it may come to pass that a higher, or lower, scale is needed going forward.

There will always be cases where systems need to read data written with
different schema versions using a consistent read-schema version.

Therefore, it is important that data written with one schema version must
be readable with a later, compatible, schema version. This, IMHO, should
include the decimal logical type. If it does not, then the type becomes
useless.

One possible solution to this is to convert values on the fly, i.e. always
deserialise a decimal using the write schema's scale and then attempt to
convert it to the read schema's scale. When it comes to decimal, there are
two possibilities for a change of scale:

1. Read schema's scale is greater than write's - such schemas should be
seen as compatible.
2. Write schema's scale is greater than read's - The safest / strictest
option here would be to see them as incompatible as there are decimals that
can be serialised by the write schema that can't be deserialised by the
read schema without rounding. A more lenient approach would be to allow the
user to supply a rounding mode in such cases.

I've not mentioned the decimal's precision up until this point as changes
in precision, from a Java stand point, don't cause any issues upon
deserialisation. However, this may not be the case for other languages and,
conceptually at least, I wonder if it should be allowed to deserialise a
decimal with a lower precision that was used to serialise it? Precision
doesn't actually seem to much affect, other than for checking if the fixed
byte array is large enough.

*Logical types in normalised form*
A secondary issue is whether the logical type should form part of the
normalised form. My own experiences here suggest it should, but I'm less
sure of this.

Having written a 'Schema Store' and client libraries, I've found that,
before logical types were available, it was important to normalise uploaded
schemas to avoid creating many versions of a schema where nothing material
changed, e.g. white space, doc changes etc. When clients required a schema
to deserialize data, or to compile, the store would again return the
normalised form and this worked well.

With the introduction of logical types this had to change. I had to write
my own normalisation code that included the logical types, as these are
important when determining if a schema has changed and for ensuring the
right schema is returned to clients.  This suggests to me that logical
types should be available in the normalised form, or at least that there
should be an option to normalise to a form that includes normalised types.

I'd love to hear what others think....

Thanks,

Andy

Decimal schema evolution.

Reply via email to