I have been involved in a big Avro prototyping phase recently (closing in on implementation rapidly). I'm sure I'll be contributing to its improvement myself down the road, but the first task is to use it.
My current largest *.avsc file is 3800 bytes and growing. I'm primarily using the Specific API on trunk (looking forward to 1.3!), and looking at writing quick/dirty wrappers for Pig, Cascading, and M/R to read/write just these static types until general solutions are available or we have time to make one. Various topics are on my mind. * Tutorial The tutorial only works with protocols. I want schemas and probably won't touch protocols for a long time. I had to make my own maven plugin that handled both. Maybe there should be an official avro-maven-plugin. Building out a broader tutorial will help developers get up to speed a lot faster. More complicated sample schemas in the test suite would be useful too. * Naming, JSON schema definition, and schema migration. A big concern of mine is long term schema migration, and how that relates to how I design my schemas. Unions quickly become necessary, but their restrictions as discussed in https://issues.apache.org/jira/browse/AVRO-248 are both good and bad. I ran into one case where I wanted: {"name": "ipAddr", "type": [ {"name": "IPv4", "type": "fixed", "size": 4 }, {"name": "IPv6", "type": "fixed", "size": 16 } ] } This won't work, even though these have their own names and generate their own classes. I suppose it is tricker outside of the specific compiler and Java. I could make the union one of records that were wrappers around these two, but then the in memory representation of this data would be inefficient. I settled on a variable length "bytes" as it is just as efficient serialized and more efficient in memory. Wrapper classes will have to enforce the sizes. Writing wrapper classes around Unions right now is annoying -- there's no way I'm letting client code cast stuff, so I'm using inheritance, factories, and generics to get something off the ground that encapsulates the usage of Avro and doesn't expose serialization details. I was also confused at first about naming, and how much of it seems unnecessary from a "just look at the text" view. For example: {"name": "Thing", "type": "record", "fields": [ {"name": "foo", "type": "string"}, {"name": "bars", "type": "array", "items": "int"} ] } is invalid, but {"name": "Thing", "type": "record", "fields": [ {"name": "foo", "type": "string"}, {"name": "bars", "type": {"type": "array", "items": "int"}} ] } is, even though from a human readable point of view, the extra information is redundant ("I told it it was an array of ints, but it demands to be an array of ints!" :D) I understand what it is doing and why for the most part, but the Spec didn't make it clear -- some things CAN'T have names. Even, when required to have a name. (fields must have names, arrays can't have names, therefore arrays can't be fields? -- half true). Some examples with counter-examples would be a plus. Also, the error message from the first one above is very confusing for someone new "invalid name 'array'" -- um, I named it "bars", "array" is the type. I think the first snippet above could be shorthand for the latter one although any ambiguity is bad for making schema resolution robust. Oddly, adding naming to unions, arrays, etc have the possibility of _reducing_ the verbosity of the JSON. Yes, one would "have to" name the array, but one wouldn't be forced to create an anonymous record inside a field, since fields must be named and almost everything in a *.avsc is a field. All of this, including AVRO-248 make me concerned a bit about data migration. If I have a data file serialized with a schema where unions are unnamed, and then later upgrade, how am I to resolve those? Should an Avro schema always contain a "version": "1.3" or something similar in the record definition? Like namespaces it can be assumed that the version propagates down to children unless overridden. I will want future code to be able to read old schemas, or at minimum break very reliably. * In Memory data representation Avro is very good at reducing serialized size, but doesn't optimize memory footprint. None of this is a big deal for the typical Hadoop use case, but for my use cases where I want to serialize these things into BDB's or some other key/value store -- in memory footprint is critical. Extra nested object references can easily consume a lot of memory and reduces the effectiveness of in memory caching for key/value stores. Another time you would want to make sure minimum memory is used is in a map side join. Although some can be done to trim up the Specific API, I think that an annotations based approach (with ASM) will be the most flexible and powerful in the long run. For example a fixed size object can just be a byte[] with an annotation, so that ASM knows how to decorate the setter/getter to enforce the size and what the Avro properties for the field are, etc. -- rather than having to be its own object that inherits from an abstract fixed type. Much other naming can collapse from objects to methods/annotations this way when generating classes from schemas, and vice-versa to generate schemas from classes without otherwise altering them. * Schema re-use Schema re-use is a challenge. Since all the types have to be available in the same JSON parse, some things get duplicated. I have duplicate GUID and IpAddr named records inside of different *.avsc files, for example. Some built-in way to have code-reuse will be helpful. I noticed that some unit tests seem to pre-process some includes. That looks a bit clunky. What else is there? It would be useful if the Specific compiler took a set of files and compiled them all, looking across the file set for named items that don't yet exist in the one currently processing. I suppose I could put them all into one *.avsc in an array. That won't take long to end up at 50K plus of text in one file if Avro is used a lot though. * Future format ideas One thing I noticed was that there is a fairly big temptation in some cases to have a union with one item and null. I have been able to essentially treat empty string, empty array, empty byte[], or a sentry value as equivalent to null for most cases, but that brought me to an idea: For many types, Avro serializes the number of entries first. Why not treat a "-1" serialized length as equivalent to null? Then [ "null", "string" ] serializes to no more space than just "string", the same goes for arrays, bytes, etc. One problem with that would be that null would always sort to one side of the union types. But it does make it "free" for any types with a size. Obviously, this is a rather incompatible change not for consideration any time soon. And it might be a lot of work for one byte. Another idea is something UNION-esque. I'm not sure what to call it, maybe a multi-union, maybe a sparse-record. The idea is that you define a bunch of optional fields in a record/union like container. All, or none of the items may be present. Right now you can simulate this with a bunch of nested unions with nulls (which is horrible to actually use with either API), but if it was intrinsic then a bitmask or other more compact serialization could be used. For some types of data where there are a lot of sparse fields, this is helpful. A map isn't great because the keys are big, and the types must all be the same. A union has only one branch and restrictions on duplication of type. The in memory footprint of this can also be optimized in ways you can't do by simulation with other types. This is the only gap I sense when defining schemas. "Its either one of these, or one of those" >>> does either contain data ? union : enum "It is not always there" >>> union with null or encode null as a value (-1, "", [], etc); "Its got a bunch of fields, but only a few of them are ever set" >>> ??? Granted, for the third one that is usually a sign of some very unstructured data, but it does happen on occasion. I don't have a need for it any time soon. What about field-groups? Sometimes, records are large and one wants to project out only a few fields. But it can take a while to find the 30th field in a complex record. Often, columns are frequently accessed together, so many systems have the ability to group them together. For a single Avro record, there could be length markers to allow the efficient skipping over column groups within records. This would increase the serialized size, but for large records, 3 or 4 markers to break it up could be a huge performance win when projecting out subsets of data. * Building and Packaging Java Published maven artifacts including source would be useful. Its nice to pull up code in eclipse, click through on an Avro class and see the source without having to configure anything -- or for the debugger to chase source code to other packages without manually finding the source. Especially when helping others on their machines (sine I have the Avro source :D). Dependencies need to be documented somewhere fairly visible. What does Avro need for runtime usage of only the Specific API? Generic? What extra does it need to generate source code from schemas/protocols? What extra does it need to generate protocols/schemas from classes via reflection? The full "avroj" jar file is huge -- 4MB. Most of it is only needed at build time. I think that is enough for now ... Looking forward to more Avro!