Re: [VOTE] Direction for Hadoop development

Steve Loughran Wed, 01 Dec 2010 04:26:50 -0800

On 29/11/10 22:30, Owen O'Malley wrote:

All,
Based on the discussion on HADOOP-6685, there is a pretty fundamental
difference of opinion about how Hadoop should evolve. We need to figure
out how the majority of the PMC wants the project to evolve to
understand which patches move us forward. Please vote whether you
approve of the following direction. Clearly as the author, I'm +1.


-- Owen

Hadoop has always included library code so that users had a strong
foundation to build their applications on without needing to continually
reinvent the wheel. This combination of framework and powerful library
code is a common pattern for successful projects, such as Java, Lucene,
etc. Toward that end, we need to continue to extend the Hadoop library
code and actively maintain it as the framework evolves. Continuing
support for SequenceFile and TFile, which are both widely used is
mandatory. The opposite pattern of implementing the framework and
letting each distribution add the required libraries will lead to
increased community fragmentation and vendor lock in.

Hadoop's generic serialization framework had a lot of promise when it
was introduced, but has been hampered by a lack of plugins other than
Writables and Java serialization. Supporting a wide range of
serializations natively in Hadoop will give the users new capabilities.
Currently, to support Avro or ProtoBuf objects mutually incompatible
third party solutions are required. It benefits Hadoop to support them
with a common framework that will support all of them. In particular,
having easy, out of the box support for Thrift, ProtoBufs, Avro, and our
legacy serializations is a desired state.

As a distributed system, there are many instances where Hadoop needs to
serialize data. Many of those applications need a lightweight, versioned
serialization framework like ProtocolBuffers or Thrift and using them is
appropriate. Adding dependences on Thrift and ProtocolBuffers to the
previous dependence on Avro is acceptable.

I'm happy with new build-time dependencies on these libraries, with onebig warning. Until an official, non-incubation release of Thrift comesout (and thrift moves from incubation), the Apache Management will vetoany redistribution of the thrift JARs; they aren't signed off as forpublic use.

I'm not so sure about more runtime depencencies that go all the way intothe classpath of the things working with HDFS, or files created in it,because that leads to version problems in private code. [InevitablyHadoop will end up adopting for some OSGi-like classpath setup, but I'mnot pushing for that as it has its own interesting issues].

At the same time -you can't add features without adding dependenciesexcept by playing rebasing tricks, and I have mixed feelings about thosetricks:

  good: lets the hadoop team push things out on their schedule

bad: impossible to push out security bug fixes to dependent librarieswithout rebuilding and re-releasing things. Your ops team will hate you.

For the bad reason, and because it's extra work, I avoid playingrebasing games, just try and do classpaths right in the first place-which is easier said than done.

One part of the HADOOP-6685 discussion raised was JSON as a format forthings. Adopting JSON -and deciding which JSON parser to use- is trouble.

Ignoring the ongoing discussion of serialization formats, the question"should we use JSON?" really leads back to "which external JSON parserdo we want to use?", which is a separate -and significant problem.

I say this as someone who has three separate json parsers on the runtimeclasspath of something whose functional tests are failing in a hudsonwindow blinking at me alongside this email application.


gson: http://code.google.com/p/google-gson/
 http://mvnrepository.com/artifact/com.google.code.gson/gson/1.4
 com.google.code.gson/gson-1.5.1; no runtime dependencies

-some people like the seamless binding to java objects, which I view asrepeating the same mistakes as WS-*.


json-lib: http://json-lib.sourceforge.net/
http://mvnrepository.com/artifact/net.sf.json-lib/json-lib/2.3
at runtime tends to need the usual commons-logging back end and
 net.sf.json-lib/json-lib-2.3
 net.sf.ezmorph/ezmorph-1.06
 commons-lang-2.4
 commons-collections-3.2.1
-low level, DOM-ish, could be improved to be more Java-5-intuitive

Jackson: http://jackson.codehaus.org
org.codehaus.jackson/jackson-core-asl-1.6.2
org.codehaus.jackson/jackson-asl/0.9.5

Now, before someone points out that three JSON parsers is too many, thissame code has log4J, SLF4J (with a back end to JSCL), a patched back endlogger for Jetty to avoid SLF4J where possible, and a custom JCLback-end. XML side there's xerces and xalan instead of the JVM versions,and hibernate pulling in dom4j alongside. Test runs add htlmunit to theclasspath, which pulls in the older httpclient libs, along with thehttp-core stuff I've switched to.

Java library versions -while more manageable than native libraryversions- are a pain. Regardless of the ugliness of XML or themediocrity of DOM, running over to JSON just because DOM is unwieldy isreplacing one source of trouble for another.

If Hadoop is going to use JSON in places, then the discussion/decisionabout which JSON parser to stick on the classpath is worthy of a JIRAissue all of its own.


-steve

(returning to his failing tests)

Re: [VOTE] Direction for Hadoop development

Reply via email to