Re: Defining Hadoop Compatibility -revisiting-

M. C. Srivas Wed, 11 May 2011 19:26:29 -0700

While the HCK is a great idea to check quickly if an implementation is
"compliant",  we still need a written specification to define what is meant
by compliance, something akin to a set of RFC's, or a set of docs like the
 IEEE POSIX specifications.


For example, the POSIX.1c pthreads API has a written document that specifies
all the function calls, input params, return values, and error codes. It
clearly indicates what any POSIX-complaint threads package needs to support,
and what are vendor-specific non-portable extensions that one can use at
one's own risk.

Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and the
specification is extracted only by looking at the code, or (where the code
is non-trivial) by writing really bizarre test programs to examine corner
cases. Further, the interaction between a mix of the old and new APIs is not
specified anywhere. Such specifications are vitally important when
implementing libraries like Cascading, Mahout, etc. For example, an
application might open a file using the new API, and pass that stream into a
library that manipulates the stream using some of the old API ... what is
then the expectation of the state of the stream when the library call
returns?

Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such
things down. There's similar good effort in the Map/Reduce and  Avro spaces,
but it seems to have stalled somewhat. We should continue it.

Doing such specs would be a great service to the community and the users of
Hadoop. It provides them
   (a) clear-cut docs on how to use the Hadoop APIs
   (b) wider choice of Hadoop implementations by freeing them from vendor
lock-in.

Once we have such specification, the HCK becomes meaningful (since the HCK
itself will be buggy initially).


On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar <[email protected]
> wrote:

> I think it's time to separate out functional tests as a "Hadoop
> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL
> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite."
>
> - milind
> --
> Milind Bhandarkar
> [email protected]
>
>
>
>
>
>
> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[email protected]> wrote:
>
> >This is a really interesting topic!  I completely agree that we need to
> >get ahead of this.
> >
> >I would be really interested in learning of any experience other apache
> >projects, such as apache or tomcat have with these issues.
> >
> >---
> >E14 - typing on glass
> >
> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[email protected]> wrote:
> >
> >>
> >> Back in Jan 2011, I started a discussion about how to define Apache
> >> Hadoop Compatibility:
> >>
> >>
> http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D
> >>[email protected]%3E
> >>
> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
> >>
> >>
> >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1
> .
> >>pdf
> >>
> >> It claims that their implementations are 100% compatible, even though
> >> the Enterprise edition uses a C filesystem. It also claims that both
> >> their software releases contain "Certified Stacks", without defining
> >> what Certified means, or who does the certification -only that it is an
> >> improvement.
> >>
> >>
> >> I think we should revisit this issue before people with their own
> >> agendas define what compatibility with Apache Hadoop is for us
> >>
> >>
> >> Licensing
> >> -Use of the Hadoop codebase must follow the Apache License
> >> http://www.apache.org/licenses/LICENSE-2.0
> >> -plug in components that are dynamically linked to (Filesystems and
> >> schedulers) don't appear to be derivative works on my reading of this,
> >>
> >> Naming
> >>  -this is something for branding@apache, they will have their opinions.
> >> The key one is that the name "Apache Hadoop" must get used, and it's
> >> important to make clear it is a derivative work.
> >>  -I don't think you can claim to have a Distribution/Fork/Version of
> >> Apache Hadoop if you swap out big chunks of it for alternate
> >> filesystems, MR engines, etc. Some description of this is needed
> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ"
> >>
> >> Compatibility
> >>  -the definition of the Hadoop interfaces and classes is the Apache
> >> Source tree,
> >>  -the definition of semantics of the Hadoop interfaces and classes is
> >> the Apache Source tree, including the test classes.
> >>  -the verification that the actual semantics of an Apache Hadoop
> >> release is compatible with the expected semantics is that current and
> >> future tests pass
> >>  -bug reports can highlight incompatibility with expectations of
> >> community users, and once incorporated into tests form part of the
> >> compatibility testing
> >>  -vendors can claim and even certify their derivative works as
> >> compatible with other versions of their derivative works, but cannot
> >> claim compatibility with Apache Hadoop unless their code passes the
> >> tests and is consistent with the bug reports marked as ("by design").
> >> Perhaps we should have tests that verify each of these "by design"
> >> bugreps to make them more formal.
> >>
> >> Certification
> >>  -I have no idea what this means in EMC's case, they just say
> >>"Certified"
> >>  -As we don't do any certification ourselves, it would seem impossible
> >> for us to certify that any derivative work is compatible.
> >>  -It may be best to state that nobody can certify their derivative as
> >> "compatible with Apache Hadoop" unless it passes all current test suites
> >>  -And require that anyone who declares compatibility define what they
> >> mean by this
> >>
> >> This is a good argument for getting more functional tests out there
> >> -whoever has more functional tests needs to get them into a test module
> >> that can be used to test real deployments.
> >>
>
>

Re: Defining Hadoop Compatibility -revisiting-

Reply via email to