Curt Cox wrote:
Let me restate, so you can tell me if I'm wrong.  "Writable is used
instead of Serializable, because it provides for more compact stream
format and allows for easier random access.  They have different
semantics, but don't have a major impact on versioning."

Serialization's formats are also somewhat more complex for interoperation with other programming languages. Hadoop, long-term, would like to provide easy data interoperability. The current attempt at this is the record i/o package:

http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/record/package-summary.html

Java's Serialization protocol would complicate things somewhat:

http://java.sun.com/j2se/1.5.0/docs/guide/serialization/spec/protocol.html#10258

This grammar would need to be re-implemented for each language.

In my experience, using Serialization instead of DataInput/DataOutput
streams has a major impact on versioning.  Serialization keeps a lot
of metadata in the stream.  This makes detecting format changes very
easy, but can really complicate backward compatibility.  Also,
serialization is geared toward preserving the connections of an object
graph, which is behind a lot of the differences you mentioned.

That all sounds correct.

You didn't address the interoperability advantage of using standard
Java classes instead of WritableS.  As I mentioned, while using
serialization would provide this benefit, it isn't necesary for it.
You could provide a mechanism for Writers to be registered for
classes.  So, instead of IntWriteable, users could just use a normal
Integer.  The stream would be byte-for-byte identical to what it is
now, but users could work with standard types.

You're right, that is an advantage of Serialization. Each class has a default (if big & slow) serialized form. The record package (linked above) is Hadoop's current plan to provide more efficient automatic serialization and language interoperability at the same time.

(A sidenode: Integer isn't the best example. In Hadoop's RPC we use ObjectWritable, so RPC protocols can already pass and return some primitive types like int, long and float. Since these types are not classes, IntWritable isn't much more awkward than Integer: values must still be explicitly wrapped.)

But we could also extend ObjectWritable to use introspection and automatically serialize arbitrary objects.

----

Let me try to answer the original question again: Why didn't I use Serialization when we first started Hadoop? Because it looked big-and-hairy and I thought we needed something lean-and-mean, where we had precise control over exactly how objects are written and read, since that is central to Hadoop. With Serialization you can get some control, but you have to fight for it.

The logic for not using RMI was similar. Effective, high-performance inter-process communications are critical to Hadoop. I felt like we'd need to precisely control how things like connections, timeouts and buffers are handled, and RMI gives you little control over those.

A quick search turns up the following on the mail archives:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200508.mbox/[EMAIL 
PROTECTED]
http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200602.mbox/[EMAIL 
PROTECTED]

The latter ends with:

In summary, I don't think I'd reject a patch that makes this change, but I also would not personally wish to spend a lot of effort implementing it, since I don't see a huge value.

That remains my opinion. One could probably migrate Hadoop to use Serializeable and/or Externalizeable, and possibly do so without a huge performance impact, and the system might even be easier to use. If someone achieved that, I'd say, "Bravo!" and commit the patch.

Doug

Reply via email to