On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote:
Eric,
Since this is mostly technical, it probably should be on the
h-6685 jira instead of gene...@hadoop.
I think Hadoop should support interfaces to control
serialization plugin lifecycle and element serialization to / from an
abstract notion of a datum and bytes only.
The core of my h-6685 patch updates the API to replace the typename
with a serialization name and serialization specific metadata. That
metadata is a set of bytes that are defined by the serialization. The
typename alone is insufficient for Avro and having additional metadata
will be useful for the other serializations as well.
Doug suggested that I add a user-friendly pair of methods and I did.
While they are redundant, the set of serializations isn't expected to
be large and therefore the extra code isn't much.
I would like to not mention
a serialization implementation by name in Hadoop proper, at all.
My patch removes some of the lingering references to Writables in
SequenceFile, MapFile, etc. and moves them over the generic
serialization API. The framework will likely continue depend on
whichever serialization is used for RPC. Currently that is Writables,
but will likely transition to either Avro or ProtoBuf in the coming
year.
A
single implementation to serve as a reference implementation makes
sense.
A critical part of Hadoop's usability comes from its framework
combined with library code that allows users to get the desired
functionality without writing it themselves. Sure, it is easy to write
a hash table yourself, but it is far easier to use the one bundled
with Java.
The default
classpath should remain as free of mandatory external dependencies as
possible; library conflicts are still an extremely sore spot in Java
development at many sites I visit and forcing a large commercial
entity to use version X o something like Avro, Thrift, PB is almost a
non-starter for many.
I discussed this problem in the jira, but either the MapReduce user is
using the X library or doesn't care the version of X. If they are
using it, it is far more convenient to have the serialization on the
classpath. There is a missing feature that we need to address to put
the user's files ahead of the system ones. I'll file a jira for that.
It might also make sense for us to shade some of our dependencies, but
that is a much bigger issue and is far from clear cut.
If a PB / Thrift / Avro serialization implementation is part of
contrib or externally managed, it requires the user to understand this
dependency exists and manage the classpath.
The goal is to make Hadoop useful out of the box. If we make it so
that Hadoop is only useful once it is bundled with 15 other projects,
that is good for people who sell distributions that include Hadoop,
but not for the project.
I think we can simplify
serialization plugin configuration via a classpath include system by
using something like run-parts or similar and the current
configuration system, but that's another issue.
The current patch loads the serialization plugins based on the
configuration. If you don't want to support thrift, don't configure
it. The same holds true of the other serializations, even writable.
I'm a bit confused as to how this equates with sequence files being
deprecated or arrested.
Doug vetoed my patch partially based on his assertion that
SequenceFiles should be deprecated and that Hadoop should just be the
framework with no library code.
If we choose to focus development
elsewhere ("soft deprecate") or actively encourage users elsewhere
("@Deprecated") is an issue I think we can sever from this discussion.
At this point the PMC has supported continuing to invest in developing
SequenceFiles.
- Don't break existing SequenceFiles.
That goes without saying, everyone has petabytes of data in them.
- Common, HDFS, MR should contain as few mandatory external deps as
humanly possible because Java classloader semantics and a lack of
internal dep isolation is just kookoo for cocoa puffs. (Simplify it
and bring on our OSGI overlords.)
That is a much bigger discussion that we should probably have. There
are costs on both sides in terms of debugging and understandability.
In particular, in most cases we are much better off using a library
that has the right functionality that re-implementing it ourselves.
- We (non-committers / users / casual contributors) want only for
Hadoop to mature in features and stability, be an inviting community
to new potential contributors and users, and to be around for a long
time.
I want that too.
-- Owen