Re: Why not use Serializable?

Doug Cutting Mon, 25 Sep 2006 13:03:42 -0700

Curt Cox wrote:

Let me restate, so you can tell me if I'm wrong.  "Writable is used
instead of Serializable, because it provides for more compact stream
format and allows for easier random access.  They have different
semantics, but don't have a major impact on versioning."

Serialization's formats are also somewhat more complex forinteroperation with other programming languages. Hadoop, long-term,would like to provide easy data interoperability. The current attemptat this is the record i/o package:


http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/record/package-summary.html

Java's Serialization protocol would complicate things somewhat:

http://java.sun.com/j2se/1.5.0/docs/guide/serialization/spec/protocol.html#10258

This grammar would need to be re-implemented for each language.

In my experience, using Serialization instead of DataInput/DataOutput
streams has a major impact on versioning.  Serialization keeps a lot
of metadata in the stream.  This makes detecting format changes very
easy, but can really complicate backward compatibility.  Also,
serialization is geared toward preserving the connections of an object
graph, which is behind a lot of the differences you mentioned.


That all sounds correct.

You didn't address the interoperability advantage of using standard
Java classes instead of WritableS.  As I mentioned, while using
serialization would provide this benefit, it isn't necesary for it.
You could provide a mechanism for Writers to be registered for
classes.  So, instead of IntWriteable, users could just use a normal
Integer.  The stream would be byte-for-byte identical to what it is
now, but users could work with standard types.

You're right, that is an advantage of Serialization. Each class has adefault (if big & slow) serialized form. The record package (linkedabove) is Hadoop's current plan to provide more efficient automaticserialization and language interoperability at the same time.

(A sidenode: Integer isn't the best example. In Hadoop's RPC we useObjectWritable, so RPC protocols can already pass and return someprimitive types like int, long and float. Since these types are notclasses, IntWritable isn't much more awkward than Integer: values muststill be explicitly wrapped.)

But we could also extend ObjectWritable to use introspection andautomatically serialize arbitrary objects.


----

Let me try to answer the original question again: Why didn't I useSerialization when we first started Hadoop? Because it lookedbig-and-hairy and I thought we needed something lean-and-mean, where wehad precise control over exactly how objects are written and read, sincethat is central to Hadoop. With Serialization you can get some control,but you have to fight for it.

The logic for not using RMI was similar. Effective, high-performanceinter-process communications are critical to Hadoop. I felt like we'dneed to precisely control how things like connections, timeouts andbuffers are handled, and RMI gives you little control over those.


A quick search turns up the following on the mail archives:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200508.mbox/[EMAIL 
PROTECTED]
http://mail-archives.apache.org/mod_mbox/lucene-hadoop-dev/200602.mbox/[EMAIL 
PROTECTED]

The latter ends with:

In summary, I don't think I'd reject a patch that makes this change, butI also would not personally wish to spend a lot of effort implementingit, since I don't see a huge value.

That remains my opinion. One could probably migrate Hadoop to useSerializeable and/or Externalizeable, and possibly do so without a hugeperformance impact, and the system might even be easier to use. Ifsomeone achieved that, I'd say, "Bravo!" and commit the patch.


Doug

Re: Why not use Serializable?

Reply via email to