HCK and written specifications are not mutually exclusive. However, given the evolving nature of Hadoop APIs, functional tests need to evolve as well, and having them tied to a "current stable" version is easier to do than it is to tie the written specifications.
- milind -- Milind Bhandarkar [email protected] +1-650-776-3167 On 5/11/11 7:26 PM, "M. C. Srivas" <[email protected]> wrote: >While the HCK is a great idea to check quickly if an implementation is >"compliant", we still need a written specification to define what is >meant >by compliance, something akin to a set of RFC's, or a set of docs like the > IEEE POSIX specifications. > >For example, the POSIX.1c pthreads API has a written document that >specifies >all the function calls, input params, return values, and error codes. It >clearly indicates what any POSIX-complaint threads package needs to >support, >and what are vendor-specific non-portable extensions that one can use at >one's own risk. > >Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the >specification is extracted only by looking at the code, or (where the code >is non-trivial) by writing really bizarre test programs to examine corner >cases. Further, the interaction between a mix of the old and new APIs is >not >specified anywhere. Such specifications are vitally important when >implementing libraries like Cascading, Mahout, etc. For example, an >application might open a file using the new API, and pass that stream >into a >library that manipulates the stream using some of the old API ... what is >then the expectation of the state of the stream when the library call >returns? > >Sanjay Radia @ Y! already started specifying some the DFS APIs to nail >such >things down. There's similar good effort in the Map/Reduce and Avro >spaces, >but it seems to have stalled somewhat. We should continue it. > >Doing such specs would be a great service to the community and the users >of >Hadoop. It provides them > (a) clear-cut docs on how to use the Hadoop APIs > (b) wider choice of Hadoop implementations by freeing them from vendor >lock-in. > >Once we have such specification, the HCK becomes meaningful (since the HCK >itself will be buggy initially). > > >On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar ><[email protected] >> wrote: > >> I think it's time to separate out functional tests as a "Hadoop >> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL >> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." >> >> - milind >> -- >> Milind Bhandarkar >> [email protected] >> >> >> >> >> >> >> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[email protected]> wrote: >> >> >This is a really interesting topic! I completely agree that we need to >> >get ahead of this. >> > >> >I would be really interested in learning of any experience other apache >> >projects, such as apache or tomcat have with these issues. >> > >> >--- >> >E14 - typing on glass >> > >> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[email protected]> >>wrote: >> > >> >> >> >> Back in Jan 2011, I started a discussion about how to define Apache >> >> Hadoop Compatibility: >> >> >> >> >> >>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D >> >>[email protected]%3E >> >> >> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >> >> >> >> >> >>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_ >>>>1 >> . >> >>pdf >> >> >> >> It claims that their implementations are 100% compatible, even though >> >> the Enterprise edition uses a C filesystem. It also claims that both >> >> their software releases contain "Certified Stacks", without defining >> >> what Certified means, or who does the certification -only that it is >>an >> >> improvement. >> >> >> >> >> >> I think we should revisit this issue before people with their own >> >> agendas define what compatibility with Apache Hadoop is for us >> >> >> >> >> >> Licensing >> >> -Use of the Hadoop codebase must follow the Apache License >> >> http://www.apache.org/licenses/LICENSE-2.0 >> >> -plug in components that are dynamically linked to (Filesystems and >> >> schedulers) don't appear to be derivative works on my reading of >>this, >> >> >> >> Naming >> >> -this is something for branding@apache, they will have their >>opinions. >> >> The key one is that the name "Apache Hadoop" must get used, and it's >> >> important to make clear it is a derivative work. >> >> -I don't think you can claim to have a Distribution/Fork/Version of >> >> Apache Hadoop if you swap out big chunks of it for alternate >> >> filesystems, MR engines, etc. Some description of this is needed >> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem >>XYZ" >> >> >> >> Compatibility >> >> -the definition of the Hadoop interfaces and classes is the Apache >> >> Source tree, >> >> -the definition of semantics of the Hadoop interfaces and classes is >> >> the Apache Source tree, including the test classes. >> >> -the verification that the actual semantics of an Apache Hadoop >> >> release is compatible with the expected semantics is that current and >> >> future tests pass >> >> -bug reports can highlight incompatibility with expectations of >> >> community users, and once incorporated into tests form part of the >> >> compatibility testing >> >> -vendors can claim and even certify their derivative works as >> >> compatible with other versions of their derivative works, but cannot >> >> claim compatibility with Apache Hadoop unless their code passes the >> >> tests and is consistent with the bug reports marked as ("by design"). >> >> Perhaps we should have tests that verify each of these "by design" >> >> bugreps to make them more formal. >> >> >> >> Certification >> >> -I have no idea what this means in EMC's case, they just say >> >>"Certified" >> >> -As we don't do any certification ourselves, it would seem >>impossible >> >> for us to certify that any derivative work is compatible. >> >> -It may be best to state that nobody can certify their derivative as >> >> "compatible with Apache Hadoop" unless it passes all current test >>suites >> >> -And require that anyone who declares compatibility define what they >> >> mean by this >> >> >> >> This is a good argument for getting more functional tests out there >> >> -whoever has more functional tests needs to get them into a test >>module >> >> that can be used to test real deployments. >> >> >> >>
