A case for adding revision field to Avro schema

Thiruvalluvan M. G. Tue, 21 Sep 2010 05:18:46 -0700

Hi all,

Here is a use case.


An application stores its objects after serialization using Avro. For each
object-type the application uses a schema. That is all objects of a given
type share the same schema. In order to deserialize an object, it should
know the schema for that object. It optimizes storage by not storing the
actual schema with each object, but a "pointer" to the schema corresponding
to the object's types. The pointer itself should be stored outside the
serialized binary.

As the application evolves, the schema for some object types changes. The
application doesn't need to do much if the new old and new schemas for an
object type "match" as specified in the Avro specification. While loading,
it uses the new schema for the object type to deserialize the object. If the
object was originally serialized using the old schema, Avro resolves the
schemas and the application transparently works as if the object was indeed
serialized using the new schema. While storing the object, it stores the
"pointer" to the new schema.

One good thing about this design is that there is no need to do schema
migration before a version change. The objects undergo schema change as they
get read and written. If for some reason, the installation needs to go back
to the old version, the objects modified by the new version in the interim
will continue to be available provided the new and old schemas match in the
opposite direction as well.

Here is a design that would improve things a bit more. Instead of
serializing the object against its actual schema, let's say the application
serializes against a union schema in which the object type's schema is a
branch. As the application evolves, the application simply adds a branch to
union. While reading the object, the application expects for one branch but
the serialized object might be using another branch. As long as the branches
"match", Avro would resolve correctly. The current Java generic writer can
correctly pick the branch as long as the object's schema is one of the
branches. The nice thing about this improved design is that, there is no
need to store a separate schema "pointer" along with the object. The
"union-index" essentially acts as the pointer and it is internal to Avro.

But there is one problem. As per the Avro specification, in order to "match"
two schemas of the same type should have the same name. But two schemas with
the same type and name cannot be branches within a union. Thus the design
above will not work. If we modify the spec as follows, it would work:

1. Add a new optional string attribute called "revision" to all named
schemas (record, enum, fixed).
2. We allow branches with union for the schemas of same type and name
provided they both have revisions and revisions are different. (Not having
revision attribute may be treated as having a null revision; but I'd rather
be less permissive here)
3. Schemas match as per the current matching rules, even if the revisions do
not match.
4. While writing the implementation should choose the branch that matches
the type, name and revision.

Caveats:
1. Though we can avoid storing the "pointer" to schema with each object, the
application should somehow figure out the type of the object so as to
associate it with the right union schema. We do not allow union of unions.
The application can "flatten" all the unions for all the object types. It's
not pretty. The application need not resort to this, if it can somehow
associate the object type with the object, (e.g. from the location of the
serialized binary).
2. I don't know the implication of this change in spec for implementations
in languages other than Java.
3. Implementations that do not support revision, should ignore it and
continue to work the way they work today. But I'm not sure what the current
ones do when they encounter an attribute they don't understand.

What do you think?

Regards

Thiru

A case for adding revision field to Avro schema

Reply via email to