hi folks, Since there's so many moving pieces with creating a full-featured Parquet reader-writer, I propose we start planning out a plan to create test fixtures and tools to enable us to develop faster.
Specifically, we need to achieve maximum decoupling between functional components. Every unit of functionality should be testable without having to create actual valid Parquet test data files. Smoke tests on real data will help, but it's a band-aid solution vs approaching the problem from a rigorous test-driven perspective. To assist with the discussion, let's address the different parts of the testing process - Functional unit testing of decoupled components. We need to make a diagram of all those boxes and what is their interface with each other. For example: a column decoder only needs to know how to ask for its next data page, but not where the data page is located physically. - Integration / macro-level testing, i.e. the "everything works together" part of the problem. I don't think investing in much top-down / integration testing of the library will help us (and may actually actively hurt us) until we organize the functional components of the library in a way that everything can be tested easily in isolation. I propose that we use a Google document to help with this design process and we can learn from parquet-mr and other implementations of Parquet to help move things along. In doing this we can cross-reference existing and new JIRAs so that it's clear exactly what needs to be done for each part of the system. Let me know your thoughts. thanks, Wes
