Re: Some thoughts after playing with Avro for a week or so

Doug Cutting Fri, 23 Oct 2009 11:16:40 -0700

Jeff Hammerbacher wrote:

* Would it be useful to express the metadata block format in the
specification as an Avro schema? The handshake is specified with a schema,
and it might help folks grasp what's going on a bit sooner.

Yes, this probably makes sense. Note that substantial changes to thefile format are currently being discussed in:


https://issues.apache.org/jira/browse/AVRO-160

Perhaps this would fit in as a part of that.

* There could be much better in-code documentation for language
implementations--e.g. we could put parts of the specification into the
docstrings for Python. Are folks trying to keep the code parsimonious by
putting the documentation on the wiki or site? I'd rather see this
documentation in the code as well, but I'd be happy to respect the desires
of the community.

I also generally prefer documentation for code to live with the code,with user-readable version generated by tools like javadoc, pydoc, dox,etc. We're already doing this. Specifications and tutorials oftendon't fit in this way, but most reference-type documentation shouldideally be generated from the code. But I don't see a lot of referencedocumentation on the wiki or website.

So it seems like perhaps like your complaint is not that thedocumentation is in the wrong place, but simply that it's insufficient.Is that right? If so, patches welcome! And probably code reviewersshould be more diligent about generating and reading the documentationbefore they accept patches.

* The header format for file object containers is specified as "Four bytes,
ASCII 'O', 'b', 'j', followed by zero.", but both Python and Java
implementations use a "VERSION" constant for the last byte of the magic
constant. Should we make this explicit in the specification?

The expectation is that someday that zero might change to a 1. Whenthat happens, the spec will be updated and code should be as well. Thecode might support multiple versions, or only a single version. I don'tsee a point in mandating how implementations handle this.

* It would be nice to have some guidance about how you expect the
specification to be implemented: e.g. how are you using the packages
"specific", "generic", and "reflect" in the Java implementation?

I expect implementations to differ, as appropriate to their programminglanguages. To some degree, "specific", "generic" and "reflect" areexperiments. There was another API, "event", that was dropped as afailed experiment.

I hope that languages more dynamic than Java will collapse "generic" and"specific"-style APIs, synthesizing new first-class data structures fromthe schema on the fly, and might not need a "reflect"-like API at all.My primary motivation for the "reflect" API is to see if I can moveHadoop onto Avro RPC without rewriting too much of Hadoop. I don't knowwhether this will work, or whether it's a good approach for otherapplications and languages.

An event-like API makes sense for some applications. Such an API is notobject-based, instead of the framework traversing the data, theapplication traverses it and calls the framework. In Java, theEncoder/Decoder API also now double as an Event-based API. This is onlysafe when a ValidatingEncoder is used, and a ResolvingDecoder should beused to handle object versioning. These too are, to some degree,experiments. They compile the schema to an LL(1) grammar and "parse"the event stream to keep track of where things are in the schema. It'ssimpler to implement an event API that's not parser-based, but itssimpler to use one that is. We'll see whether folks implementparser-based event APIs for other languages.

Are these the sort of implementation notes you'd like to see in thespec? If so, it's hard for me to make firm recommendations until we'vegot more production uses. Once approaches are validated by use, thenperhaps they merit description in the spec. At least that's myinstinct. If others prefer, they might propose adding someimplementation notes sooner.

* The Python implementation uses, e.g., "getmeta" rather than "get_meta" or
"getMeta". Following PEP-8 or Google's Python style guidelines (
http://code.google.com/p/soc/wiki/PythonStyleGuide) or "Code like a
Pythonista" (
http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html<http://python.net/%7Egoodger/projects/pycon/2007/idiomatic/handout.html>)
would be a good idea, but the patch to refactor the code would be fairly
large. Is it worth taking on this code hygiene issue sooner rather than
later?

I think each API should use the best practices of its language. It'sconvenient when different languages name similar things similarly. Soif for example C ends up with APIs that are similar to Java's specificand generic, then it would reduce confusion if these were calledspecific and generic too.


Doug

Re: Some thoughts after playing with Avro for a week or so

Reply via email to