Re: Multi-language serialization discussion

Sanjay Radia Tue, 28 Oct 2008 17:58:35 -0700


On Oct 24, 2008, at 2:39 PM, Doug Cutting wrote:

Bryan Duxbury wrote:

> I've been reading the discussion about what serialization/RPCproject to

> use on http://wiki.apache.org/hadoop/Release1.0Requirements, and I
> thought I'd throw in a pro-Thrift vote.

I've been thinking about this, and here's where I've come to:

It's not just RPC.  We need a single, primary object serialization
system that's used for RPC and for most file-based application data.

Scripting languages are primary users of Hadoop.  We must thus make it
easy and natural for scripting languages to process data with Hadoop.

Data should be self-describing. For example, a script should beable to

read a file without having to first generate code specific to the
records in that file.  Similarly, a script should be able to write
records without having to externally define their schema.


I like the self describing data for the reasons you have state.

Q. I assume that in many cases the reader of some serialized data isexpecting a particular data-definition (or versions of it). In thiscase thereader has the expected data-definition that was generated from theidl. If the two data-definitions (the one from the idl and the otherfrom the serialized data) do not match (modulo versions), then is anexception is thrown?


sanjay

We need an efficient binary file format.  A file of records should not
repeat the record names with each record.  Rather, the record schema
used should be stored in the file once. Programs should be able toread
the schema and efficiently produce instances from the file.

The schema language should support specification of required and
optional fields, so that class definitions may evolve.

For some languages (e.g., Java & C) one may wish to generate native
classes to represent a schema, and to read & write instances.

So, how well does Thrift meet these needs?  Thrift's IDL is a schema
language, and JSON is a self-describing data format. But arbitraryJSON
data is not generally readable by any Thrift-based program.  And
Thrift's binary formats are not self-describing: they do not includetheIDL. Nor does the Thrift runtime in each language permit one toread anIDL specification and then use it to efficiently read and writecompact,
self-describing data.

I wonder if we might instead use use JSON schemas to describe data.

http://groups.google.com/group/json-schema/web/json-schema-proposal---second-draft

We'd implement, in each language, a codec that, given a schema, can
efficiently read and write instances of that schema. (JSON schemasareJSON data, so any language that supports JSON can already read andwrite
a JSON schema.)  The writer could either take a provided schema, or
automatically induce a schema from the records written.  Schemas would
be stored in data files, with the data.

JSON's not perfect.  It doesn't (yet) support binary data: that would
need to be fixed.  But I think Thrift's focus on code-generation makes
it less friendly to scripting languages, which are primary users of
Hadoop.  Code generation is possible given a schema, and may be useful
as an optimization in many cases, but it should be optional, notcentral.
Folks should be able to process any file without externalinformation or
external compilers.  A small runtime codec is be all that should be
implemented in each language. Even if that's not present, datacould betransparently and losslessly converted to and from textual JSON by,e.g.
C utility programs, since most languages already have JSON codecs.

Does this make any sense?

Doug

Re: Multi-language serialization discussion

Reply via email to