Re: [VOTE] Direction for Hadoop development

Owen O'Malley Tue, 07 Dec 2010 00:14:40 -0800


On Dec 6, 2010, at 7:36 PM, Eric Sammer wrote:


Eric,

Since this is mostly technical, it probably should be on theh-6685 jira instead of gene...@hadoop.

I think Hadoop should support interfaces to control
serialization plugin lifecycle and element serialization to / from an
abstract notion of a datum and bytes only.

The core of my h-6685 patch updates the API to replace the typenamewith a serialization name and serialization specific metadata. Thatmetadata is a set of bytes that are defined by the serialization. Thetypename alone is insufficient for Avro and having additional metadatawill be useful for the other serializations as well.

Doug suggested that I add a user-friendly pair of methods and I did.While they are redundant, the set of serializations isn't expected tobe large and therefore the extra code isn't much.

I would like to not mention
a serialization implementation by name in Hadoop proper, at all.

My patch removes some of the lingering references to Writables inSequenceFile, MapFile, etc. and moves them over the genericserialization API. The framework will likely continue depend onwhichever serialization is used for RPC. Currently that is Writables,but will likely transition to either Avro or ProtoBuf in the comingyear.

A
single implementation to serve as a reference implementation makes
sense.

A critical part of Hadoop's usability comes from its frameworkcombined with library code that allows users to get the desiredfunctionality without writing it themselves. Sure, it is easy to writea hash table yourself, but it is far easier to use the one bundledwith Java.

The default
classpath should remain as free of mandatory external dependencies as
possible; library conflicts are still an extremely sore spot in Java
development at many sites I visit and forcing a large commercial
entity to use version X o something like Avro, Thrift, PB is almost a
non-starter for many.

I discussed this problem in the jira, but either the MapReduce user isusing the X library or doesn't care the version of X. If they areusing it, it is far more convenient to have the serialization on theclasspath. There is a missing feature that we need to address to putthe user's files ahead of the system ones. I'll file a jira for that.

It might also make sense for us to shade some of our dependencies, butthat is a much bigger issue and is far from clear cut.

If a PB / Thrift / Avro serialization implementation is part of
contrib or externally managed, it requires the user to understand this
dependency exists and manage the classpath.

The goal is to make Hadoop useful out of the box. If we make it sothat Hadoop is only useful once it is bundled with 15 other projects,that is good for people who sell distributions that include Hadoop,but not for the project.

I think we can simplify
serialization plugin configuration via a classpath include system by
using something like run-parts or similar and the current
configuration system, but that's another issue.

The current patch loads the serialization plugins based on theconfiguration. If you don't want to support thrift, don't configureit. The same holds true of the other serializations, even writable.

I'm a bit confused as to how this equates with sequence files being
deprecated or arrested.

Doug vetoed my patch partially based on his assertion thatSequenceFiles should be deprecated and that Hadoop should just be theframework with no library code.

 If we choose to focus development
elsewhere ("soft deprecate") or actively encourage users elsewhere
("@Deprecated") is an issue I think we can sever from this discussion.

At this point the PMC has supported continuing to invest in developingSequenceFiles.

- Don't break existing SequenceFiles.


That goes without saying, everyone has petabytes of data in them.

- Common, HDFS, MR should contain as few mandatory external deps as
humanly possible because Java classloader semantics and a lack of
internal dep isolation is just kookoo for cocoa puffs. (Simplify it
and bring on our OSGI overlords.)

That is a much bigger discussion that we should probably have. Thereare costs on both sides in terms of debugging and understandability.In particular, in most cases we are much better off using a librarythat has the right functionality that re-implementing it ourselves.

- We (non-committers / users / casual contributors) want only for
Hadoop to mature in features and stability, be an inviting community
to new potential contributors and users, and to be around for a long
time.


I want that too.

-- Owen

Re: [VOTE] Direction for Hadoop development

Reply via email to