While the HCK is a great idea to check quickly if an implementation is "compliant", we still need a written specification to define what is meant by compliance, something akin to a set of RFC's, or a set of docs like the IEEE POSIX specifications.
For example, the POSIX.1c pthreads API has a written document that specifies all the function calls, input params, return values, and error codes. It clearly indicates what any POSIX-complaint threads package needs to support, and what are vendor-specific non-portable extensions that one can use at one's own risk. Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the specification is extracted only by looking at the code, or (where the code is non-trivial) by writing really bizarre test programs to examine corner cases. Further, the interaction between a mix of the old and new APIs is not specified anywhere. Such specifications are vitally important when implementing libraries like Cascading, Mahout, etc. For example, an application might open a file using the new API, and pass that stream into a library that manipulates the stream using some of the old API ... what is then the expectation of the state of the stream when the library call returns? Sanjay Radia @ Y! already started specifying some the DFS APIs to nail such things down. There's similar good effort in the Map/Reduce and Avro spaces, but it seems to have stalled somewhat. We should continue it. Doing such specs would be a great service to the community and the users of Hadoop. It provides them (a) clear-cut docs on how to use the Hadoop APIs (b) wider choice of Hadoop implementations by freeing them from vendor lock-in. Once we have such specification, the HCK becomes meaningful (since the HCK itself will be buggy initially). On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar <[email protected] > wrote: > I think it's time to separate out functional tests as a "Hadoop > Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL > 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." > > - milind > -- > Milind Bhandarkar > [email protected] > > > > > > > On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[email protected]> wrote: > > >This is a really interesting topic! I completely agree that we need to > >get ahead of this. > > > >I would be really interested in learning of any experience other apache > >projects, such as apache or tomcat have with these issues. > > > >--- > >E14 - typing on glass > > > >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[email protected]> wrote: > > > >> > >> Back in Jan 2011, I started a discussion about how to define Apache > >> Hadoop Compatibility: > >> > >> > http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D > >>[email protected]%3E > >> > >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet > >> > >> > >>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_1 > . > >>pdf > >> > >> It claims that their implementations are 100% compatible, even though > >> the Enterprise edition uses a C filesystem. It also claims that both > >> their software releases contain "Certified Stacks", without defining > >> what Certified means, or who does the certification -only that it is an > >> improvement. > >> > >> > >> I think we should revisit this issue before people with their own > >> agendas define what compatibility with Apache Hadoop is for us > >> > >> > >> Licensing > >> -Use of the Hadoop codebase must follow the Apache License > >> http://www.apache.org/licenses/LICENSE-2.0 > >> -plug in components that are dynamically linked to (Filesystems and > >> schedulers) don't appear to be derivative works on my reading of this, > >> > >> Naming > >> -this is something for branding@apache, they will have their opinions. > >> The key one is that the name "Apache Hadoop" must get used, and it's > >> important to make clear it is a derivative work. > >> -I don't think you can claim to have a Distribution/Fork/Version of > >> Apache Hadoop if you swap out big chunks of it for alternate > >> filesystems, MR engines, etc. Some description of this is needed > >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem XYZ" > >> > >> Compatibility > >> -the definition of the Hadoop interfaces and classes is the Apache > >> Source tree, > >> -the definition of semantics of the Hadoop interfaces and classes is > >> the Apache Source tree, including the test classes. > >> -the verification that the actual semantics of an Apache Hadoop > >> release is compatible with the expected semantics is that current and > >> future tests pass > >> -bug reports can highlight incompatibility with expectations of > >> community users, and once incorporated into tests form part of the > >> compatibility testing > >> -vendors can claim and even certify their derivative works as > >> compatible with other versions of their derivative works, but cannot > >> claim compatibility with Apache Hadoop unless their code passes the > >> tests and is consistent with the bug reports marked as ("by design"). > >> Perhaps we should have tests that verify each of these "by design" > >> bugreps to make them more formal. > >> > >> Certification > >> -I have no idea what this means in EMC's case, they just say > >>"Certified" > >> -As we don't do any certification ourselves, it would seem impossible > >> for us to certify that any derivative work is compatible. > >> -It may be best to state that nobody can certify their derivative as > >> "compatible with Apache Hadoop" unless it passes all current test suites > >> -And require that anyone who declares compatibility define what they > >> mean by this > >> > >> This is a good argument for getting more functional tests out there > >> -whoever has more functional tests needs to get them into a test module > >> that can be used to test real deployments. > >> > >
