Steve,

I'm interested in the "sampling for testing" direction -- that sounds like a
really good idea. The ability to "fuzz test" your MR job over a subset of
the input is a good idea. Figuring out how to handle
InputFormat/serialization issues regarding records extracted in such a way
is likely to be a big challenge there -- though not insurmountable. The
other interesting challenge is "how do you write an output specification for
that sampled input run, that can be easily verified as correct?"

I'm a bit more confused about the topicality of the "distributing JUnit
tests" problem. In my mind, the direction MRUnit should pursue is to make it
easier to test MapReduce jobs, and more specifically, their constituent
components (currently, Mappers and Reducers). The ability to parallelize
test execution of generic JUnit tests across a number of machines is a
pretty different problem; can you explain where something like
Hudson/Jenkins that already has the ability to aggregate parallel testing
workloads isn't the right choice, and what we could do better/differently?

Cheers,
- Aaron


On Fri, May 27, 2011 at 4:36 PM, Patrick Hunt <[email protected]> wrote:

> Even subprojects are considered separate communities (at least that's
> my understanding of it). In general Apache frowns on subs. I believe
> another option is to have the code base as a separate repo from the
> tlp, but still part of the tlp, with separate dev/release cycles but a
> single "community". This is the ideal, not sure how it reconciles with
> the real world.
>
> Patrick
>
> On Thu, May 26, 2011 at 11:04 AM, Eric Sammer <[email protected]>
> wrote:
> > And I think that leads in to the conversation about where mrunit goes
> when
> > we graduate. The original purpose of a breakout was mostly to allow
> separate
> > release cycles and to be able to support multiple versions of Hadoop (if
> we
> > wanted to do such things) without having circular dependencies across
> > versions. If we graduate to a standalone subproject of Hadoop (which may
> be
> > an option, subject to the Hadoop PMC's approval) we could "reunite" the
> > communities while still remaining independent. Just a thought.
> >
> > On Thu, May 26, 2011 at 9:58 AM, Patrick Hunt <[email protected]> wrote:
> >
> >> I had suggested something like this in one of the original "remove
> >> MRUNIT from hadoop contrib" threads... There was some push back about
> >> community fragmentation (tests should live in hadoop), but I
> >> personally don't see why not, we could course correct as things
> >> mature.
> >>
> >> On Thu, May 26, 2011 at 3:35 AM, Steve Loughran <[email protected]>
> wrote:
> >> > I'm thinking, could MRUnit be the place to put in other hadoop-testing
> >> code.
> >> >
> >> > specifically
> >> >
> >> > == Junit on multiple hosts ==
> >> >
> >> >
> >> > I have some prototype code to exec junit test cases as MR jobs,
> collect
> >> the
> >> > results (including serialized throwables). It runs one test per line
> of
> >> text
> >> > (the name of the package). It could be better to support lines of
> tests
> >> and
> >> > config options, or other ways to explore the config space. And I'd
> really
> >> > like to be able to deploy the junit tests to all the workers in the
> >> cluster,
> >> > the reduction would be to identify which boxes are playing up.
> >> >
> >> > == Sampling for testing ==
> >> >
> >> > Good desktop tests need real data, which means sampling from the live
> >> > datasets. Some standard MR jobs to do the sampling (which themselves
> use
> >> MR
> >> > Unit to self-test) could make it easier to sample.
> >> >
> >> > thoughts?
> >> >
> >>
> >
> >
> >
> > --
> > Eric Sammer
> > twitter: esammer
> > data: www.cloudera.com
> >
>

Reply via email to