On 12/05/2011 03:26, M. C. Srivas wrote:
While the HCK is a great idea to check quickly if an implementation is
"compliant", we still need a written specification to define what is meant
by compliance, something akin to a set of RFC's, or a set of docs like the
IEEE POSIX specifications.
For example, the POSIX.1c pthreads API has a written document that specifies
all the function calls, input params, return values, and error codes. It
clearly indicates what any POSIX-complaint threads package needs to support,
and what are vendor-specific non-portable extensions that one can use at
one's own risk.
I have been known to be critical of standards bodies in the past
http://www.waterfall2006.com/loughran.html
And I've been in them. It is absolutely essential that the Hadoop stack
doesn't become controlled by a standards body, as then you become
controlled by whoever can afford to send the most people to the
standards events -and make behind the scenes deals with others to get
votes through.
Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the
specification is extracted only by looking at the code, or (where the code
is non-trivial) by writing really bizarre test programs to examine corner
cases. Further, the interaction between a mix of the old and new APIs is not
specified anywhere. Such specifications are vitally important when
implementing libraries like Cascading, Mahout, etc. For example, an
application might open a file using the new API, and pass that stream into a
library that manipulates the stream using some of the old API ... what is
then the expectation of the state of the stream when the library call
returns?
Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such
things down. There's similar good effort in the Map/Reduce and Avro spaces,
but it seems to have stalled somewhat. We should continue it.
Doing such specs would be a great service to the community and the users of
Hadoop. It provides them
(a) clear-cut docs on how to use the Hadoop APIs#
+1
(b) wider choice of Hadoop implementations by freeing them from vendor
lock-in.
=0
They won't be hadoop implementations, they will be "something that is
compatible with the Apache Hadoop API as defined in v 0.x of the Hadoop
compatibility kit". Furthermore, there's the issue of any google patents
-while google have given Hadoop permission to them, that may not apply
to other things that implement compatible APIs.
I also think that the Hadoop team need to be the one's who own the
interfaces and tests, define the tests as a functional test suite for
testing Hadoop distributions, and reserve the right to make changes to
the interfaces, semantics and tests as suits the teams needs. The input
from others -especially related community projects- are important, but,
to be ruthless, the compatibility issues with things that aren't really
Apache Hadoop are less important. you choose to reimplement Hadoop, you
take on the costs of staying current.
Once we have such specification, the HCK becomes meaningful (since the HCK
itself will be buggy initially).