Doug, There has previously been a bunch of discussion on the Thrift list (possibly pre-Incubator) about self-describing Thrift streams and the like when we talked about providing a superset of RecordIO functionality. Re-open that discussion and I imagine you might get some interested parties. Writing an interpreter of Thrift type descriptors for any of the scripting language doesn't seem like it would be that hard.
Bumping up a level, rather than inventing a whole new set of Hadoop-specific RPC and serialization mechanisms, I'd suggest that there would be more leverage from adopting Thrift. Thrift is in the Apache Incubator (as you know ;)) and there is already a fairly significant overlap in the two communities. A number of Hadoop-related technologies are already using Thrift in places (HBase, Hive, etc). If there was more involvement in Thrift from core Hadoop development, I am pretty certain you would get what you wanted out of it pretty quickly. Chad -----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] On Behalf Of Doug Cutting Sent: Friday, October 24, 2008 2:40 PM To: core-dev@hadoop.apache.org Subject: Re: Multi-language serialization discussion Bryan Duxbury wrote: > I've been reading the discussion about what serialization/RPC project to > use on http://wiki.apache.org/hadoop/Release1.0Requirements, and I > thought I'd throw in a pro-Thrift vote. I've been thinking about this, and here's where I've come to: It's not just RPC. We need a single, primary object serialization system that's used for RPC and for most file-based application data. Scripting languages are primary users of Hadoop. We must thus make it easy and natural for scripting languages to process data with Hadoop. Data should be self-describing. For example, a script should be able to read a file without having to first generate code specific to the records in that file. Similarly, a script should be able to write records without having to externally define their schema. We need an efficient binary file format. A file of records should not repeat the record names with each record. Rather, the record schema used should be stored in the file once. Programs should be able to read the schema and efficiently produce instances from the file. The schema language should support specification of required and optional fields, so that class definitions may evolve. For some languages (e.g., Java & C) one may wish to generate native classes to represent a schema, and to read & write instances. So, how well does Thrift meet these needs? Thrift's IDL is a schema language, and JSON is a self-describing data format. But arbitrary JSON data is not generally readable by any Thrift-based program. And Thrift's binary formats are not self-describing: they do not include the IDL. Nor does the Thrift runtime in each language permit one to read an IDL specification and then use it to efficiently read and write compact, self-describing data. I wonder if we might instead use use JSON schemas to describe data. http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft We'd implement, in each language, a codec that, given a schema, can efficiently read and write instances of that schema. (JSON schemas are JSON data, so any language that supports JSON can already read and write a JSON schema.) The writer could either take a provided schema, or automatically induce a schema from the records written. Schemas would be stored in data files, with the data. JSON's not perfect. It doesn't (yet) support binary data: that would need to be fixed. But I think Thrift's focus on code-generation makes it less friendly to scripting languages, which are primary users of Hadoop. Code generation is possible given a schema, and may be useful as an optimization in many cases, but it should be optional, not central. Folks should be able to process any file without external information or external compilers. A small runtime codec is be all that should be implemented in each language. Even if that's not present, data could be transparently and losslessly converted to and from textual JSON by, e.g. C utility programs, since most languages already have JSON codecs. Does this make any sense? Doug