Jeff Hammerbacher wrote:
* Would it be useful to express the metadata block format in the specification as an Avro schema? The handshake is specified with a schema, and it might help folks grasp what's going on a bit sooner.
Yes, this probably makes sense. Note that substantial changes to the file format are currently being discussed in:
https://issues.apache.org/jira/browse/AVRO-160 Perhaps this would fit in as a part of that.
* There could be much better in-code documentation for language implementations--e.g. we could put parts of the specification into the docstrings for Python. Are folks trying to keep the code parsimonious by putting the documentation on the wiki or site? I'd rather see this documentation in the code as well, but I'd be happy to respect the desires of the community.
I also generally prefer documentation for code to live with the code, with user-readable version generated by tools like javadoc, pydoc, dox, etc. We're already doing this. Specifications and tutorials often don't fit in this way, but most reference-type documentation should ideally be generated from the code. But I don't see a lot of reference documentation on the wiki or website.
So it seems like perhaps like your complaint is not that the documentation is in the wrong place, but simply that it's insufficient. Is that right? If so, patches welcome! And probably code reviewers should be more diligent about generating and reading the documentation before they accept patches.
* The header format for file object containers is specified as "Four bytes, ASCII 'O', 'b', 'j', followed by zero.", but both Python and Java implementations use a "VERSION" constant for the last byte of the magic constant. Should we make this explicit in the specification?
The expectation is that someday that zero might change to a 1. When that happens, the spec will be updated and code should be as well. The code might support multiple versions, or only a single version. I don't see a point in mandating how implementations handle this.
* It would be nice to have some guidance about how you expect the specification to be implemented: e.g. how are you using the packages "specific", "generic", and "reflect" in the Java implementation?
I expect implementations to differ, as appropriate to their programming languages. To some degree, "specific", "generic" and "reflect" are experiments. There was another API, "event", that was dropped as a failed experiment.
I hope that languages more dynamic than Java will collapse "generic" and "specific"-style APIs, synthesizing new first-class data structures from the schema on the fly, and might not need a "reflect"-like API at all. My primary motivation for the "reflect" API is to see if I can move Hadoop onto Avro RPC without rewriting too much of Hadoop. I don't know whether this will work, or whether it's a good approach for other applications and languages.
An event-like API makes sense for some applications. Such an API is not object-based, instead of the framework traversing the data, the application traverses it and calls the framework. In Java, the Encoder/Decoder API also now double as an Event-based API. This is only safe when a ValidatingEncoder is used, and a ResolvingDecoder should be used to handle object versioning. These too are, to some degree, experiments. They compile the schema to an LL(1) grammar and "parse" the event stream to keep track of where things are in the schema. It's simpler to implement an event API that's not parser-based, but its simpler to use one that is. We'll see whether folks implement parser-based event APIs for other languages.
Are these the sort of implementation notes you'd like to see in the spec? If so, it's hard for me to make firm recommendations until we've got more production uses. Once approaches are validated by use, then perhaps they merit description in the spec. At least that's my instinct. If others prefer, they might propose adding some implementation notes sooner.
* The Python implementation uses, e.g., "getmeta" rather than "get_meta" or "getMeta". Following PEP-8 or Google's Python style guidelines ( http://code.google.com/p/soc/wiki/PythonStyleGuide) or "Code like a Pythonista" ( http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html<http://python.net/%7Egoodger/projects/pycon/2007/idiomatic/handout.html>) would be a good idea, but the patch to refactor the code would be fairly large. Is it worth taking on this code hygiene issue sooner rather than later?
I think each API should use the best practices of its language. It's convenient when different languages name similar things similarly. So if for example C ends up with APIs that are similar to Java's specific and generic, then it would reduce confusion if these were called specific and generic too.
Doug
