Sounds good to me (especially keeping the unit test suite fast).

Added you as a commenter to the doc. I will endeavor to keep it updated as
new JIRAs come in.

On Sun, Jan 31, 2016 at 7:03 PM, Julien Le Dem <[email protected]> wrote:

> Thanks Wes for doing the coordination.
> This looks great to me.
>
> I support the need for unit tests and individually testable components.
> This is fairly standard and in my experience it is just more productive to
> do thing that way as you can iterate faster.
> For example in parquet-mr the following things are tested independently:
>  - the encodings without the file format
>  - the file format (footer, row groups, pages) without the encodings.
>  - the assembly algorithm without the encodings or anything else bellow.
>  - individual model conversion without the file format.
>
> Of course we also test everything together from the map reduce standpoint
> but that comes last (and those tests are a little slow because of mr).
>
> I would also stress that individual unit tests should be fast. That means
> unit tests run on small scale data that exercises corner cases.
>
> This is a great doc Wes. Could add me as a commenter?
>
> On Sun, Jan 31, 2016 at 12:11 PM, Wes McKinney <[email protected]> wrote:
>
> > Dear all,
> >
> > I created a publicly available document where we can organize the
> > parquet-cpp roadmap and outstanding JIRAs. I tried to organize all of the
> > open JIRAs by functional component. Since there are about 40 open JIRAs
> now
> > (and this will continue to balloon as we make progress) this seems like a
> > good way to stay on the same page.
> >
> >
> >
> https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit#
> >
> > Please request edit access and I will add you -- anyone can view (but not
> > edit) the document.
> >
> > I stress that it is going to be extremely difficult for us to move
> forward
> > in parallel without stopping to invest in unit test infrastructure and
> > designing every component in a way that it can be tested in isolation.
> I've
> > begun doing this for the primitive column readers in
> > https://github.com/apache/parquet-cpp/pull/32, but it's a bare minimum
> > effort to be able to write tests for the work that's been done the last
> two
> > weeks.
> >
> > Thank you,
> > Wes
> >
> > On Fri, Jan 29, 2016 at 10:48 AM, Julien Le Dem <[email protected]>
> wrote:
> >
> > > Sounds good to me.
> > > at some point (later) we'll have to do some cross compatibility testing
> > > with parquet-mr as well to make sure everything is on the same page.
> > > CC'ing some folks who should probably chime in.
> > >
> > >
> > > On Fri, Jan 29, 2016 at 10:21 AM, Wes McKinney <[email protected]>
> wrote:
> > >
> > > > hi folks,
> > > >
> > > > Since there's so many moving pieces with creating a full-featured
> > Parquet
> > > > reader-writer, I propose we start planning out a plan to create test
> > > > fixtures and tools to enable us to develop faster.
> > > >
> > > > Specifically, we need to achieve maximum decoupling between
> functional
> > > > components. Every unit of functionality should be testable without
> > having
> > > > to create actual valid Parquet test data files. Smoke tests on real
> > data
> > > > will help, but it's a band-aid solution vs approaching the problem
> > from a
> > > > rigorous test-driven perspective.
> > > >
> > > > To assist with the discussion, let's address the different parts of
> the
> > > > testing process
> > > >
> > > > - Functional unit testing of decoupled components. We need to make a
> > > > diagram of all those boxes and what is their interface with each
> other.
> > > For
> > > > example: a column decoder only needs to know how to ask for its next
> > data
> > > > page, but not where the data page is located physically.
> > > >
> > > > - Integration / macro-level testing, i.e. the "everything works
> > together"
> > > > part of the problem.
> > > >
> > > > I don't think investing in much top-down / integration testing of the
> > > > library will help us (and may actually actively hurt us) until we
> > > organize
> > > > the functional components of the library in a way that everything can
> > be
> > > > tested easily in isolation.
> > > >
> > > > I propose that we use a Google document to help with this design
> > process
> > > > and we can learn from parquet-mr and other implementations of Parquet
> > to
> > > > help move things along. In doing this we can cross-reference existing
> > and
> > > > new JIRAs so that it's clear exactly what needs to be done for each
> > part
> > > of
> > > > the system.
> > > >
> > > > Let me know your thoughts.
> > > >
> > > > thanks,
> > > > Wes
> > > >
> > >
> > >
> > >
> > > --
> > > Julien
> > >
> >
>
>
>
> --
> Julien
>

Reply via email to