Sounds good to me (especially keeping the unit test suite fast). Added you as a commenter to the doc. I will endeavor to keep it updated as new JIRAs come in.
On Sun, Jan 31, 2016 at 7:03 PM, Julien Le Dem <[email protected]> wrote: > Thanks Wes for doing the coordination. > This looks great to me. > > I support the need for unit tests and individually testable components. > This is fairly standard and in my experience it is just more productive to > do thing that way as you can iterate faster. > For example in parquet-mr the following things are tested independently: > - the encodings without the file format > - the file format (footer, row groups, pages) without the encodings. > - the assembly algorithm without the encodings or anything else bellow. > - individual model conversion without the file format. > > Of course we also test everything together from the map reduce standpoint > but that comes last (and those tests are a little slow because of mr). > > I would also stress that individual unit tests should be fast. That means > unit tests run on small scale data that exercises corner cases. > > This is a great doc Wes. Could add me as a commenter? > > On Sun, Jan 31, 2016 at 12:11 PM, Wes McKinney <[email protected]> wrote: > > > Dear all, > > > > I created a publicly available document where we can organize the > > parquet-cpp roadmap and outstanding JIRAs. I tried to organize all of the > > open JIRAs by functional component. Since there are about 40 open JIRAs > now > > (and this will continue to balloon as we make progress) this seems like a > > good way to stay on the same page. > > > > > > > https://docs.google.com/document/d/1WyquzupLc3UkErO2OhqLJNQ9a84Cccc8LVUSuLQz39o/edit# > > > > Please request edit access and I will add you -- anyone can view (but not > > edit) the document. > > > > I stress that it is going to be extremely difficult for us to move > forward > > in parallel without stopping to invest in unit test infrastructure and > > designing every component in a way that it can be tested in isolation. > I've > > begun doing this for the primitive column readers in > > https://github.com/apache/parquet-cpp/pull/32, but it's a bare minimum > > effort to be able to write tests for the work that's been done the last > two > > weeks. > > > > Thank you, > > Wes > > > > On Fri, Jan 29, 2016 at 10:48 AM, Julien Le Dem <[email protected]> > wrote: > > > > > Sounds good to me. > > > at some point (later) we'll have to do some cross compatibility testing > > > with parquet-mr as well to make sure everything is on the same page. > > > CC'ing some folks who should probably chime in. > > > > > > > > > On Fri, Jan 29, 2016 at 10:21 AM, Wes McKinney <[email protected]> > wrote: > > > > > > > hi folks, > > > > > > > > Since there's so many moving pieces with creating a full-featured > > Parquet > > > > reader-writer, I propose we start planning out a plan to create test > > > > fixtures and tools to enable us to develop faster. > > > > > > > > Specifically, we need to achieve maximum decoupling between > functional > > > > components. Every unit of functionality should be testable without > > having > > > > to create actual valid Parquet test data files. Smoke tests on real > > data > > > > will help, but it's a band-aid solution vs approaching the problem > > from a > > > > rigorous test-driven perspective. > > > > > > > > To assist with the discussion, let's address the different parts of > the > > > > testing process > > > > > > > > - Functional unit testing of decoupled components. We need to make a > > > > diagram of all those boxes and what is their interface with each > other. > > > For > > > > example: a column decoder only needs to know how to ask for its next > > data > > > > page, but not where the data page is located physically. > > > > > > > > - Integration / macro-level testing, i.e. the "everything works > > together" > > > > part of the problem. > > > > > > > > I don't think investing in much top-down / integration testing of the > > > > library will help us (and may actually actively hurt us) until we > > > organize > > > > the functional components of the library in a way that everything can > > be > > > > tested easily in isolation. > > > > > > > > I propose that we use a Google document to help with this design > > process > > > > and we can learn from parquet-mr and other implementations of Parquet > > to > > > > help move things along. In doing this we can cross-reference existing > > and > > > > new JIRAs so that it's clear exactly what needs to be done for each > > part > > > of > > > > the system. > > > > > > > > Let me know your thoughts. > > > > > > > > thanks, > > > > Wes > > > > > > > > > > > > > > > > -- > > > Julien > > > > > > > > > -- > Julien >
