On Thu, May 12, 2011 at 09:45, Milind Bhandarkar <[email protected]> wrote: > HCK and written specifications are not mutually exclusive. However, given > the evolving nature of Hadoop APIs, functional tests need to evolve as
I would actually expand it to 'functional and system tests' because latter are capable of validating inter-component iterations not coverable by functional tests. Cos > well, and having them tied to a "current stable" version is easier to do > than it is to tie the written specifications. > > - milind > > -- > Milind Bhandarkar > [email protected] > +1-650-776-3167 > > > > > > > On 5/11/11 7:26 PM, "M. C. Srivas" <[email protected]> wrote: > >>While the HCK is a great idea to check quickly if an implementation is >>"compliant", we still need a written specification to define what is >>meant >>by compliance, something akin to a set of RFC's, or a set of docs like the >> IEEE POSIX specifications. >> >>For example, the POSIX.1c pthreads API has a written document that >>specifies >>all the function calls, input params, return values, and error codes. It >>clearly indicates what any POSIX-complaint threads package needs to >>support, >>and what are vendor-specific non-portable extensions that one can use at >>one's own risk. >> >>Currently we have 2 sets of API in the DFS and Map/Reduce layers, and the >>specification is extracted only by looking at the code, or (where the code >>is non-trivial) by writing really bizarre test programs to examine corner >>cases. Further, the interaction between a mix of the old and new APIs is >>not >>specified anywhere. Such specifications are vitally important when >>implementing libraries like Cascading, Mahout, etc. For example, an >>application might open a file using the new API, and pass that stream >>into a >>library that manipulates the stream using some of the old API ... what is >>then the expectation of the state of the stream when the library call >>returns? >> >>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail >>such >>things down. There's similar good effort in the Map/Reduce and Avro >>spaces, >>but it seems to have stalled somewhat. We should continue it. >> >>Doing such specs would be a great service to the community and the users >>of >>Hadoop. It provides them >> (a) clear-cut docs on how to use the Hadoop APIs >> (b) wider choice of Hadoop implementations by freeing them from vendor >>lock-in. >> >>Once we have such specification, the HCK becomes meaningful (since the HCK >>itself will be buggy initially). >> >> >>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar >><[email protected] >>> wrote: >> >>> I think it's time to separate out functional tests as a "Hadoop >>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under ASL >>> 2.0. Then "certification" would mean "Passes 100% of the HCK testsuite." >>> >>> - milind >>> -- >>> Milind Bhandarkar >>> [email protected] >>> >>> >>> >>> >>> >>> >>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[email protected]> wrote: >>> >>> >This is a really interesting topic! I completely agree that we need to >>> >get ahead of this. >>> > >>> >I would be really interested in learning of any experience other apache >>> >projects, such as apache or tomcat have with these issues. >>> > >>> >--- >>> >E14 - typing on glass >>> > >>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[email protected]> >>>wrote: >>> > >>> >> >>> >> Back in Jan 2011, I started a discussion about how to define Apache >>> >> Hadoop Compatibility: >>> >> >>> >> >>> >>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C4D >>> >>[email protected]%3E >>> >> >>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet >>> >> >>> >> >>> >>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Final_ >>>>>1 >>> . >>> >>pdf >>> >> >>> >> It claims that their implementations are 100% compatible, even though >>> >> the Enterprise edition uses a C filesystem. It also claims that both >>> >> their software releases contain "Certified Stacks", without defining >>> >> what Certified means, or who does the certification -only that it is >>>an >>> >> improvement. >>> >> >>> >> >>> >> I think we should revisit this issue before people with their own >>> >> agendas define what compatibility with Apache Hadoop is for us >>> >> >>> >> >>> >> Licensing >>> >> -Use of the Hadoop codebase must follow the Apache License >>> >> http://www.apache.org/licenses/LICENSE-2.0 >>> >> -plug in components that are dynamically linked to (Filesystems and >>> >> schedulers) don't appear to be derivative works on my reading of >>>this, >>> >> >>> >> Naming >>> >> -this is something for branding@apache, they will have their >>>opinions. >>> >> The key one is that the name "Apache Hadoop" must get used, and it's >>> >> important to make clear it is a derivative work. >>> >> -I don't think you can claim to have a Distribution/Fork/Version of >>> >> Apache Hadoop if you swap out big chunks of it for alternate >>> >> filesystems, MR engines, etc. Some description of this is needed >>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem >>>XYZ" >>> >> >>> >> Compatibility >>> >> -the definition of the Hadoop interfaces and classes is the Apache >>> >> Source tree, >>> >> -the definition of semantics of the Hadoop interfaces and classes is >>> >> the Apache Source tree, including the test classes. >>> >> -the verification that the actual semantics of an Apache Hadoop >>> >> release is compatible with the expected semantics is that current and >>> >> future tests pass >>> >> -bug reports can highlight incompatibility with expectations of >>> >> community users, and once incorporated into tests form part of the >>> >> compatibility testing >>> >> -vendors can claim and even certify their derivative works as >>> >> compatible with other versions of their derivative works, but cannot >>> >> claim compatibility with Apache Hadoop unless their code passes the >>> >> tests and is consistent with the bug reports marked as ("by design"). >>> >> Perhaps we should have tests that verify each of these "by design" >>> >> bugreps to make them more formal. >>> >> >>> >> Certification >>> >> -I have no idea what this means in EMC's case, they just say >>> >>"Certified" >>> >> -As we don't do any certification ourselves, it would seem >>>impossible >>> >> for us to certify that any derivative work is compatible. >>> >> -It may be best to state that nobody can certify their derivative as >>> >> "compatible with Apache Hadoop" unless it passes all current test >>>suites >>> >> -And require that anyone who declares compatibility define what they >>> >> mean by this >>> >> >>> >> This is a good argument for getting more functional tests out there >>> >> -whoever has more functional tests needs to get them into a test >>>module >>> >> that can be used to test real deployments. >>> >> >>> >>> > >
