Re: Defining Hadoop Compatibility -revisiting-

Milind Bhandarkar Thu, 12 May 2011 20:37:38 -0700

Cos,

Can you give me an example of a "system test" that is not a functional
test ? My assumption was that the functionality being tested is specific
to a component, and that inter-component interactions (that's what you
meant, right?) would be taken care by the public interface and semantics
of a component API.


- milind

-- 
Milind Bhandarkar
[email protected]
+1-650-776-3167






On 5/12/11 3:30 PM, "Konstantin Boudnik" <[email protected]> wrote:

>On Thu, May 12, 2011 at 09:45, Milind Bhandarkar
><[email protected]> wrote:
>> HCK and written specifications are not mutually exclusive. However,
>>given
>> the evolving nature of Hadoop APIs, functional tests need to evolve as
>
>I would actually expand it to 'functional and system tests' because
>latter are capable of validating inter-component iterations not
>coverable by functional tests.
>
>Cos
>
>> well, and having them tied to a "current stable" version is easier to do
>> than it is to tie the written specifications.
>>
>> - milind
>>
>> --
>> Milind Bhandarkar
>> [email protected]
>> +1-650-776-3167
>>
>>
>>
>>
>>
>>
>> On 5/11/11 7:26 PM, "M. C. Srivas" <[email protected]> wrote:
>>
>>>While the HCK is a great idea to check quickly if an implementation is
>>>"compliant",  we still need a written specification to define what is
>>>meant
>>>by compliance, something akin to a set of RFC's, or a set of docs like
>>>the
>>> IEEE POSIX specifications.
>>>
>>>For example, the POSIX.1c pthreads API has a written document that
>>>specifies
>>>all the function calls, input params, return values, and error codes. It
>>>clearly indicates what any POSIX-complaint threads package needs to
>>>support,
>>>and what are vendor-specific non-portable extensions that one can use at
>>>one's own risk.
>>>
>>>Currently we have 2 sets of API  in the DFS and Map/Reduce layers, and
>>>the
>>>specification is extracted only by looking at the code, or (where the
>>>code
>>>is non-trivial) by writing really bizarre test programs to examine
>>>corner
>>>cases. Further, the interaction between a mix of the old and new APIs is
>>>not
>>>specified anywhere. Such specifications are vitally important when
>>>implementing libraries like Cascading, Mahout, etc. For example, an
>>>application might open a file using the new API, and pass that stream
>>>into a
>>>library that manipulates the stream using some of the old API ... what
>>>is
>>>then the expectation of the state of the stream when the library call
>>>returns?
>>>
>>>Sanjay Radia @ Y! already started specifying some the DFS APIs to nail
>>>such
>>>things down. There's similar good effort in the Map/Reduce and  Avro
>>>spaces,
>>>but it seems to have stalled somewhat. We should continue it.
>>>
>>>Doing such specs would be a great service to the community and the users
>>>of
>>>Hadoop. It provides them
>>>   (a) clear-cut docs on how to use the Hadoop APIs
>>>   (b) wider choice of Hadoop implementations by freeing them from
>>>vendor
>>>lock-in.
>>>
>>>Once we have such specification, the HCK becomes meaningful (since the
>>>HCK
>>>itself will be buggy initially).
>>>
>>>
>>>On Wed, May 11, 2011 at 2:46 PM, Milind Bhandarkar
>>><[email protected]
>>>> wrote:
>>>
>>>> I think it's time to separate out functional tests as a "Hadoop
>>>> Compatibility Kit (HCK)", similar to the Sun TCK for Java, but under
>>>>ASL
>>>> 2.0. Then "certification" would mean "Passes 100% of the HCK
>>>>testsuite."
>>>>
>>>> - milind
>>>> --
>>>> Milind Bhandarkar
>>>> [email protected]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 5/11/11 2:24 PM, "Eric Baldeschwieler" <[email protected]>
>>>>wrote:
>>>>
>>>> >This is a really interesting topic!  I completely agree that we need
>>>>to
>>>> >get ahead of this.
>>>> >
>>>> >I would be really interested in learning of any experience other
>>>>apache
>>>> >projects, such as apache or tomcat have with these issues.
>>>> >
>>>> >---
>>>> >E14 - typing on glass
>>>> >
>>>> >On May 10, 2011, at 6:31 AM, "Steve Loughran" <[email protected]>
>>>>wrote:
>>>> >
>>>> >>
>>>> >> Back in Jan 2011, I started a discussion about how to define Apache
>>>> >> Hadoop Compatibility:
>>>> >>
>>>> >>
>>>>
>>>>http://mail-archives.apache.org/mod_mbox/hadoop-general/201101.mbox/%3C
>>>>4D
>>>> >>[email protected]%3E
>>>> >>
>>>> >> I am now reading EMC HD "Enterprise Ready" Apache Hadoop datasheet
>>>> >>
>>>> >>
>>>>
>>>>>>http://www.greenplum.com/sites/default/files/EMC_Greenplum_HD_DS_Fina
>>>>>>l_
>>>>>>1
>>>> .
>>>> >>pdf
>>>> >>
>>>> >> It claims that their implementations are 100% compatible, even
>>>>though
>>>> >> the Enterprise edition uses a C filesystem. It also claims that
>>>>both
>>>> >> their software releases contain "Certified Stacks", without
>>>>defining
>>>> >> what Certified means, or who does the certification -only that it
>>>>is
>>>>an
>>>> >> improvement.
>>>> >>
>>>> >>
>>>> >> I think we should revisit this issue before people with their own
>>>> >> agendas define what compatibility with Apache Hadoop is for us
>>>> >>
>>>> >>
>>>> >> Licensing
>>>> >> -Use of the Hadoop codebase must follow the Apache License
>>>> >> http://www.apache.org/licenses/LICENSE-2.0
>>>> >> -plug in components that are dynamically linked to (Filesystems and
>>>> >> schedulers) don't appear to be derivative works on my reading of
>>>>this,
>>>> >>
>>>> >> Naming
>>>> >>  -this is something for branding@apache, they will have their
>>>>opinions.
>>>> >> The key one is that the name "Apache Hadoop" must get used, and
>>>>it's
>>>> >> important to make clear it is a derivative work.
>>>> >>  -I don't think you can claim to have a Distribution/Fork/Version
>>>>of
>>>> >> Apache Hadoop if you swap out big chunks of it for alternate
>>>> >> filesystems, MR engines, etc. Some description of this is needed
>>>> >> "Supports the Apache Hadoop MapReduce engine on top of Filesystem
>>>>XYZ"
>>>> >>
>>>> >> Compatibility
>>>> >>  -the definition of the Hadoop interfaces and classes is the Apache
>>>> >> Source tree,
>>>> >>  -the definition of semantics of the Hadoop interfaces and classes
>>>>is
>>>> >> the Apache Source tree, including the test classes.
>>>> >>  -the verification that the actual semantics of an Apache Hadoop
>>>> >> release is compatible with the expected semantics is that current
>>>>and
>>>> >> future tests pass
>>>> >>  -bug reports can highlight incompatibility with expectations of
>>>> >> community users, and once incorporated into tests form part of the
>>>> >> compatibility testing
>>>> >>  -vendors can claim and even certify their derivative works as
>>>> >> compatible with other versions of their derivative works, but
>>>>cannot
>>>> >> claim compatibility with Apache Hadoop unless their code passes the
>>>> >> tests and is consistent with the bug reports marked as ("by
>>>>design").
>>>> >> Perhaps we should have tests that verify each of these "by design"
>>>> >> bugreps to make them more formal.
>>>> >>
>>>> >> Certification
>>>> >>  -I have no idea what this means in EMC's case, they just say
>>>> >>"Certified"
>>>> >>  -As we don't do any certification ourselves, it would seem
>>>>impossible
>>>> >> for us to certify that any derivative work is compatible.
>>>> >>  -It may be best to state that nobody can certify their derivative
>>>>as
>>>> >> "compatible with Apache Hadoop" unless it passes all current test
>>>>suites
>>>> >>  -And require that anyone who declares compatibility define what
>>>>they
>>>> >> mean by this
>>>> >>
>>>> >> This is a good argument for getting more functional tests out there
>>>> >> -whoever has more functional tests needs to get them into a test
>>>>module
>>>> >> that can be used to test real deployments.
>>>> >>
>>>>
>>>>
>>
>>

Re: Defining Hadoop Compatibility -revisiting-

Reply via email to