hi folks,

Since there's so many moving pieces with creating a full-featured Parquet
reader-writer, I propose we start planning out a plan to create test
fixtures and tools to enable us to develop faster.

Specifically, we need to achieve maximum decoupling between functional
components. Every unit of functionality should be testable without having
to create actual valid Parquet test data files. Smoke tests on real data
will help, but it's a band-aid solution vs approaching the problem from a
rigorous test-driven perspective.

To assist with the discussion, let's address the different parts of the
testing process

- Functional unit testing of decoupled components. We need to make a
diagram of all those boxes and what is their interface with each other. For
example: a column decoder only needs to know how to ask for its next data
page, but not where the data page is located physically.

- Integration / macro-level testing, i.e. the "everything works together"
part of the problem.

I don't think investing in much top-down / integration testing of the
library will help us (and may actually actively hurt us) until we organize
the functional components of the library in a way that everything can be
tested easily in isolation.

I propose that we use a Google document to help with this design process
and we can learn from parquet-mr and other implementations of Parquet to
help move things along. In doing this we can cross-reference existing and
new JIRAs so that it's clear exactly what needs to be done for each part of
the system.

Let me know your thoughts.

thanks,
Wes

Reply via email to