Thanks Wes for doing the coordination.
This looks great to me.

I support the need for unit tests and individually testable components.
This is fairly standard and in my experience it is just more productive to
do thing that way as you can iterate faster.
For example in parquet-mr the following things are tested independently:
 - the encodings without the file format
 - the file format (footer, row groups, pages) without the encodings.
 - the assembly algorithm without the encodings or anything else bellow.
 - individual model conversion without the file format.

Of course we also test everything together from the map reduce standpoint
but that comes last (and those tests are a little slow because of mr).

I would also stress that individual unit tests should be fast. That means
unit tests run on small scale data that exercises corner cases.

This is a great doc Wes. Could add me as a commenter?

On Sun, Jan 31, 2016 at 12:11 PM, Wes McKinney <[email protected]> wrote:

> Dear all,
>
> I created a publicly available document where we can organize the
> parquet-cpp roadmap and outstanding JIRAs. I tried to organize all of the
> open JIRAs by functional component. Since there are about 40 open JIRAs now
> (and this will continue to balloon as we make progress) this seems like a
> good way to stay on the same page.
>
>
> https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit#
>
> Please request edit access and I will add you -- anyone can view (but not
> edit) the document.
>
> I stress that it is going to be extremely difficult for us to move forward
> in parallel without stopping to invest in unit test infrastructure and
> designing every component in a way that it can be tested in isolation. I've
> begun doing this for the primitive column readers in
> https://github.com/apache/parquet-cpp/pull/32, but it's a bare minimum
> effort to be able to write tests for the work that's been done the last two
> weeks.
>
> Thank you,
> Wes
>
> On Fri, Jan 29, 2016 at 10:48 AM, Julien Le Dem <[email protected]> wrote:
>
> > Sounds good to me.
> > at some point (later) we'll have to do some cross compatibility testing
> > with parquet-mr as well to make sure everything is on the same page.
> > CC'ing some folks who should probably chime in.
> >
> >
> > On Fri, Jan 29, 2016 at 10:21 AM, Wes McKinney <[email protected]> wrote:
> >
> > > hi folks,
> > >
> > > Since there's so many moving pieces with creating a full-featured
> Parquet
> > > reader-writer, I propose we start planning out a plan to create test
> > > fixtures and tools to enable us to develop faster.
> > >
> > > Specifically, we need to achieve maximum decoupling between functional
> > > components. Every unit of functionality should be testable without
> having
> > > to create actual valid Parquet test data files. Smoke tests on real
> data
> > > will help, but it's a band-aid solution vs approaching the problem
> from a
> > > rigorous test-driven perspective.
> > >
> > > To assist with the discussion, let's address the different parts of the
> > > testing process
> > >
> > > - Functional unit testing of decoupled components. We need to make a
> > > diagram of all those boxes and what is their interface with each other.
> > For
> > > example: a column decoder only needs to know how to ask for its next
> data
> > > page, but not where the data page is located physically.
> > >
> > > - Integration / macro-level testing, i.e. the "everything works
> together"
> > > part of the problem.
> > >
> > > I don't think investing in much top-down / integration testing of the
> > > library will help us (and may actually actively hurt us) until we
> > organize
> > > the functional components of the library in a way that everything can
> be
> > > tested easily in isolation.
> > >
> > > I propose that we use a Google document to help with this design
> process
> > > and we can learn from parquet-mr and other implementations of Parquet
> to
> > > help move things along. In doing this we can cross-reference existing
> and
> > > new JIRAs so that it's clear exactly what needs to be done for each
> part
> > of
> > > the system.
> > >
> > > Let me know your thoughts.
> > >
> > > thanks,
> > > Wes
> > >
> >
> >
> >
> > --
> > Julien
> >
>



-- 
Julien

Reply via email to