Summary of Column Index Testing Efforts

Zoltan Ivanfi Thu, 13 Dec 2018 09:34:35 -0800

Hi,

In the last Parquet sync we have been asked to share details of the
testing we had done to validate the correctness of column indexes. We
were also asked to validate the column index structures of a
random-generated data set against the contracts. Please see my summary
of our efforts below:


- TestColumnIndexBuilder is part of the code base. It is a low-level
test for the ColumnIndexBuilder class that is used both on the write
path and the read path. The test uses in-memory data to assert the
correctness of min and max values, null counts and filtering based on
them.

- TestColumnIndexFilter is part of the code base as well. It is a
low-level test for the ColumnIndexFilter class that is used on the
read path only. The test uses in-memory data to assert the correctness
of filtering based on indexes.

- TestColumnIndexFiltering is the third major test in the code. It is
a higher-level test that writes and reads actual files. It asserts
correct filtering end-to-end (from writing to reading).

- We also acted on the new ask to validate the contract end-to-end.
This validation is not part of the code base as it has not been
requested at the time when the feature got merged but only when voting
for the release. Because we would not like to delay the release
further by polishing this test and going through the review process,
we will only add it to the Parquet code base later (the same checks
are already covered at a lower level in TestColumnIndexBuilder). We
have already ran these new tests though and validated that indexes
fulfill their contracts. The fully functional but unpolished test code
can be checked here:
https://github.com/zivanfi/parquet-mr/commit/0e74b4207daa0b53cf4bd0774d2632c388845cb9

- Finally, we have ran our internal interoperability test suite. This
test suite drives Parquet through Hive, Impala and Spark. It writes
data using all of these engines and reads them back in every
combination and covering various use cases. In the Parquet sync I had
reported that we were still working on getting Spark working with our
test suite on the Parquet release candidate. Since then we have
managed to run the tests on Spark as well and have found an unrelated
issue that leads to missing results on the Impala -> Spark data path
when filtering on DECIMAL values (PARQUET-1472). While this issue is
unrelated to the column index feature (and has been present in the
code base since Parquet 1.10), we have decided to correct this and we
will shortly prepare a new release candidate that incorporates this
fix.

Based on our testing efforts summarized above, we are confident in the
correctness of the column index feature.

Br,

Zoltan

Summary of Column Index Testing Efforts

Reply via email to