Re: Defining Hadoop Compatibility -revisiting-

Steve Loughran Thu, 12 May 2011 02:33:42 -0700

On 12/05/2011 03:26, M. C. Srivas wrote:

While the HCK is a great idea to check quickly if an implementation is
"compliant",  we still need a written specification to define what is meant
by compliance, something akin to a set of RFC's, or a set of docs like the
  IEEE POSIX specifications.


For example, the POSIX.1c pthreads API has a written document that specifies
all the function calls, input params, return values, and error codes. It
clearly indicates what any POSIX-complaint threads package needs to support,
and what are vendor-specific non-portable extensions that one can use at
one's own risk.


I have been known to be critical of standards bodies in the past
http://www.waterfall2006.com/loughran.html

And I've been in them. It is absolutely essential that the Hadoop stackdoesn't become controlled by a standards body, as then you becomecontrolled by whoever can afford to send the most people to thestandards events -and make behind the scenes deals with others to getvotes through.

Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
specification is extracted only by looking at the code, or (where the code
is non-trivial) by writing really bizarre test programs to examine corner
cases. Further, the interaction between a mix of the old and new APIs is not
specified anywhere. Such specifications are vitally important when
implementing libraries like Cascading, Mahout, etc. For example, an
application might open a file using the new API, and pass that stream into a
library that manipulates the stream using some of the old API ... what is
then the expectation of the state of the stream when the library call
returns?

Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such
things down. There's similar good effort in the Map/Reduce and  Avro spaces,
but it seems to have stalled somewhat. We should continue it.

Doing such specs would be a great service to the community and the users of
Hadoop. It provides them
    (a) clear-cut docs on how to use the Hadoop APIs#

+1

    (b) wider choice of Hadoop implementations by freeing them from vendor
lock-in.

=0

They won't be hadoop implementations, they will be "something that iscompatible with the Apache Hadoop API as defined in v 0.x of the Hadoopcompatibility kit". Furthermore, there's the issue of any google patents-while google have given Hadoop permission to them, that may not applyto other things that implement compatible APIs.

I also think that the Hadoop team need to be the one's who own theinterfaces and tests, define the tests as a functional test suite fortesting Hadoop distributions, and reserve the right to make changes tothe interfaces, semantics and tests as suits the teams needs. The inputfrom others -especially related community projects- are important, but,to be ruthless, the compatibility issues with things that aren't reallyApache Hadoop are less important. you choose to reimplement Hadoop, youtake on the costs of staying current.


Once we have such specification, the HCK becomes meaningful (since the HCK
itself will be buggy initially).

Re: Defining Hadoop Compatibility -revisiting-

Reply via email to