Hi, In the last Parquet sync we have been asked to share details of the testing we had done to validate the correctness of column indexes. We were also asked to validate the column index structures of a random-generated data set against the contracts. Please see my summary of our efforts below:
- TestColumnIndexBuilder is part of the code base. It is a low-level test for the ColumnIndexBuilder class that is used both on the write path and the read path. The test uses in-memory data to assert the correctness of min and max values, null counts and filtering based on them. - TestColumnIndexFilter is part of the code base as well. It is a low-level test for the ColumnIndexFilter class that is used on the read path only. The test uses in-memory data to assert the correctness of filtering based on indexes. - TestColumnIndexFiltering is the third major test in the code. It is a higher-level test that writes and reads actual files. It asserts correct filtering end-to-end (from writing to reading). - We also acted on the new ask to validate the contract end-to-end. This validation is not part of the code base as it has not been requested at the time when the feature got merged but only when voting for the release. Because we would not like to delay the release further by polishing this test and going through the review process, we will only add it to the Parquet code base later (the same checks are already covered at a lower level in TestColumnIndexBuilder). We have already ran these new tests though and validated that indexes fulfill their contracts. The fully functional but unpolished test code can be checked here: https://github.com/zivanfi/parquet-mr/commit/0e74b4207daa0b53cf4bd0774d2632c388845cb9 - Finally, we have ran our internal interoperability test suite. This test suite drives Parquet through Hive, Impala and Spark. It writes data using all of these engines and reads them back in every combination and covering various use cases. In the Parquet sync I had reported that we were still working on getting Spark working with our test suite on the Parquet release candidate. Since then we have managed to run the tests on Spark as well and have found an unrelated issue that leads to missing results on the Impala -> Spark data path when filtering on DECIMAL values (PARQUET-1472). While this issue is unrelated to the column index feature (and has been present in the code base since Parquet 1.10), we have decided to correct this and we will shortly prepare a new release candidate that incorporates this fix. Based on our testing efforts summarized above, we are confident in the correctness of the column index feature. Br, Zoltan
