On 29/11/10 22:30, Owen O'Malley wrote:
All,
Based on the discussion on HADOOP-6685, there is a pretty fundamental
difference of opinion about how Hadoop should evolve. We need to figure
out how the majority of the PMC wants the project to evolve to
understand which patches move us forward. Please vote whether you
approve of the following direction. Clearly as the author, I'm +1.
-- Owen
Hadoop has always included library code so that users had a strong
foundation to build their applications on without needing to continually
reinvent the wheel. This combination of framework and powerful library
code is a common pattern for successful projects, such as Java, Lucene,
etc. Toward that end, we need to continue to extend the Hadoop library
code and actively maintain it as the framework evolves. Continuing
support for SequenceFile and TFile, which are both widely used is
mandatory. The opposite pattern of implementing the framework and
letting each distribution add the required libraries will lead to
increased community fragmentation and vendor lock in.
Hadoop's generic serialization framework had a lot of promise when it
was introduced, but has been hampered by a lack of plugins other than
Writables and Java serialization. Supporting a wide range of
serializations natively in Hadoop will give the users new capabilities.
Currently, to support Avro or ProtoBuf objects mutually incompatible
third party solutions are required. It benefits Hadoop to support them
with a common framework that will support all of them. In particular,
having easy, out of the box support for Thrift, ProtoBufs, Avro, and our
legacy serializations is a desired state.
As a distributed system, there are many instances where Hadoop needs to
serialize data. Many of those applications need a lightweight, versioned
serialization framework like ProtocolBuffers or Thrift and using them is
appropriate. Adding dependences on Thrift and ProtocolBuffers to the
previous dependence on Avro is acceptable.
I'm happy with new build-time dependencies on these libraries, with one
big warning. Until an official, non-incubation release of Thrift comes
out (and thrift moves from incubation), the Apache Management will veto
any redistribution of the thrift JARs; they aren't signed off as for
public use.
I'm not so sure about more runtime depencencies that go all the way into
the classpath of the things working with HDFS, or files created in it,
because that leads to version problems in private code. [Inevitably
Hadoop will end up adopting for some OSGi-like classpath setup, but I'm
not pushing for that as it has its own interesting issues].
At the same time -you can't add features without adding dependencies
except by playing rebasing tricks, and I have mixed feelings about those
tricks:
good: lets the hadoop team push things out on their schedule
bad: impossible to push out security bug fixes to dependent libraries
without rebuilding and re-releasing things. Your ops team will hate you.
For the bad reason, and because it's extra work, I avoid playing
rebasing games, just try and do classpaths right in the first place
-which is easier said than done.
One part of the HADOOP-6685 discussion raised was JSON as a format for
things. Adopting JSON -and deciding which JSON parser to use- is trouble.
Ignoring the ongoing discussion of serialization formats, the question
"should we use JSON?" really leads back to "which external JSON parser
do we want to use?", which is a separate -and significant problem.
I say this as someone who has three separate json parsers on the runtime
classpath of something whose functional tests are failing in a hudson
window blinking at me alongside this email application.
gson: http://code.google.com/p/google-gson/
http://mvnrepository.com/artifact/com.google.code.gson/gson/1.4
com.google.code.gson/gson-1.5.1; no runtime dependencies
-some people like the seamless binding to java objects, which I view as
repeating the same mistakes as WS-*.
json-lib: http://json-lib.sourceforge.net/
http://mvnrepository.com/artifact/net.sf.json-lib/json-lib/2.3
at runtime tends to need the usual commons-logging back end and
net.sf.json-lib/json-lib-2.3
net.sf.ezmorph/ezmorph-1.06
commons-lang-2.4
commons-collections-3.2.1
-low level, DOM-ish, could be improved to be more Java-5-intuitive
Jackson: http://jackson.codehaus.org
org.codehaus.jackson/jackson-core-asl-1.6.2
org.codehaus.jackson/jackson-asl/0.9.5
Now, before someone points out that three JSON parsers is too many, this
same code has log4J, SLF4J (with a back end to JSCL), a patched back end
logger for Jetty to avoid SLF4J where possible, and a custom JCL
back-end. XML side there's xerces and xalan instead of the JVM versions,
and hibernate pulling in dom4j alongside. Test runs add htlmunit to the
classpath, which pulls in the older httpclient libs, along with the
http-core stuff I've switched to.
Java library versions -while more manageable than native library
versions- are a pain. Regardless of the ugliness of XML or the
mediocrity of DOM, running over to JSON just because DOM is unwieldy is
replacing one source of trouble for another.
If Hadoop is going to use JSON in places, then the discussion/decision
about which JSON parser to stick on the classpath is worthy of a JIRA
issue all of its own.
-steve
(returning to his failing tests)