Jay, This sounds to me like something of general utility that would make a great addition to Avro.
To be clear, I assume you mean contributing this as source code for a service that folks can deploy, right? For example, it might be a Java project that builds a WAR file that, when deployed, presents a REST front end and talks to a backing persistence layer where the schemas are stored. Is that right? Also note that Avro recently added a standard facility for defining Schema fingerprints that might be used as Schema IDs in such a service: http://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas This has currently been implemented in Java and C#: http://avro.apache.org/docs/current/api/java/org/apache/avro/SchemaNormalization.html I like the notion of a Schema source for the uses you describe. For records might this simply be the fully-qualified record name? For unions and other unnamed types it might take the same form as a record name. We could use the "long form" for primitives and always supply a name so that an string schema might be {"type":"string", "name":"org.foo.Bar"}. Would this work, or is there some other structure and use of sources for which schema names are not a good match? Cheers, Doug On Tue, Jul 10, 2012 at 10:53 AM, Jay Kreps <[email protected]> wrote: > I noticed in AVRO-1006 there was a mention of standardizing on some kind of > schema repository that would maintain a central set of all versions of a > schema and allow a way to reference schemas by id. > > At LinkedIn we have standardized (almost) all of our persistent data on > Avro and we have a repository like this for managing schemas. Messages are > stored with the schema in Hadoop, but for systems that store rows > independently like databases or messaging we instead store a schema id with > each row/message. We would love for there to be an open source version of > this to make it possible to open up our other tools > for compatibility checking, etl and other things that depend on service. > > The service itself is basically a REST service that maintains schemas. Each > schema has a "source" that it is associated with (the table or messaging > topic or whatever) and a unique id. Schemas can be fetched by id or you can > get the latest schema for a given source. Having the notion of sources > allows us to do two things: (1) enforce a compatibility modal on schema > changes (no backwards incompatible changes for various definitions of > backwards compatibility), and (2) allow our hadoop etl to project all > messages forward to the latest schema (since AvroFile requires a single > schema not a per-row schema). > > If the Avro project is interested in adopting an official repository that > would be really nice. It is frankly a pretty trivial piece of code, but > standardization would allow interoperability between things. I would be > willing to either open source our repository implementation or do a > from-scratch one if we come up with more requirements. > > -Jay
