Hi all, Here is a use case.
An application stores its objects after serialization using Avro. For each object-type the application uses a schema. That is all objects of a given type share the same schema. In order to deserialize an object, it should know the schema for that object. It optimizes storage by not storing the actual schema with each object, but a "pointer" to the schema corresponding to the object's types. The pointer itself should be stored outside the serialized binary. As the application evolves, the schema for some object types changes. The application doesn't need to do much if the new old and new schemas for an object type "match" as specified in the Avro specification. While loading, it uses the new schema for the object type to deserialize the object. If the object was originally serialized using the old schema, Avro resolves the schemas and the application transparently works as if the object was indeed serialized using the new schema. While storing the object, it stores the "pointer" to the new schema. One good thing about this design is that there is no need to do schema migration before a version change. The objects undergo schema change as they get read and written. If for some reason, the installation needs to go back to the old version, the objects modified by the new version in the interim will continue to be available provided the new and old schemas match in the opposite direction as well. Here is a design that would improve things a bit more. Instead of serializing the object against its actual schema, let's say the application serializes against a union schema in which the object type's schema is a branch. As the application evolves, the application simply adds a branch to union. While reading the object, the application expects for one branch but the serialized object might be using another branch. As long as the branches "match", Avro would resolve correctly. The current Java generic writer can correctly pick the branch as long as the object's schema is one of the branches. The nice thing about this improved design is that, there is no need to store a separate schema "pointer" along with the object. The "union-index" essentially acts as the pointer and it is internal to Avro. But there is one problem. As per the Avro specification, in order to "match" two schemas of the same type should have the same name. But two schemas with the same type and name cannot be branches within a union. Thus the design above will not work. If we modify the spec as follows, it would work: 1. Add a new optional string attribute called "revision" to all named schemas (record, enum, fixed). 2. We allow branches with union for the schemas of same type and name provided they both have revisions and revisions are different. (Not having revision attribute may be treated as having a null revision; but I'd rather be less permissive here) 3. Schemas match as per the current matching rules, even if the revisions do not match. 4. While writing the implementation should choose the branch that matches the type, name and revision. Caveats: 1. Though we can avoid storing the "pointer" to schema with each object, the application should somehow figure out the type of the object so as to associate it with the right union schema. We do not allow union of unions. The application can "flatten" all the unions for all the object types. It's not pretty. The application need not resort to this, if it can somehow associate the object type with the object, (e.g. from the location of the serialized binary). 2. I don't know the implication of this change in spec for implementations in languages other than Java. 3. Implementations that do not support revision, should ignore it and continue to work the way they work today. But I'm not sure what the current ones do when they encounter an attribute they don't understand. What do you think? Regards Thiru
