Hi dev@,

As part of acquiring the final sign offs for the parquet release, we had
another discussion with Ryan about the testing of column indexes.
He asked us to break down our testing into "fun size" test bites and send
them out incrementally to the community, so everyone would have time to
digest and focus on the testing case by case.

We also agreed that we will be focusing on the write path, since that is
the critical part that we have to support for the long-haul.

I will be sending out the related tests, please feel free to ask clarifying
questions or if you have a particular test case you want to see next,
please let me know.

As a first "taste" Zoltan created the following tool to *validate the
contract* for the indices with random generated data:

- The min value stored in the index must be less than or equal to all
  values.
- The max value stored in the index must be greater than or equal to all
  values.
- The null count stored in the index must be equal to the number of
  nulls.
- Only pages consisting entirely of NULL-s can be marked as a null page
  in the index.
- According to the ASCENDING boundary order, the min value for a page
  must be greater than or equal to the min value of the previous page.
- According to the ASCENDING boundary order, the max value for a page
  must be greater than or equal to the max value of the previous page.
- According to the DESCENDING boundary order, the min value for a page
  must be less than or equal to the min value of the previous page.
- According to the DESCENDING boundary order, the max value for a page
  must be less than or equal to the max value of the previous page.

https://github.com/zivanfi/parquet-mr/commit/0e74b4207daa0b53cf4bd0774d2632c388845cb9
The code itself is unpolished, but fully functional and complete.
Once the release is signed off we plan to refactor this and offer it as
part of parquet tools or parquet cli,
however it is perfectly fine for validation as-is.

Best,
Anna

Reply via email to