Hi everyone, Currently I am working on the implementation of the Parquet page index for Impala. (design doc is here if you are interested: https://docs.google.com/document/d/1D-el8njq_I-JKd3NDcW1mRXID_n0dBDKIkjWxwULVus/edit?usp=sharing )
During our discussions it came up that DataPageHeaderV2 states that page boundaries are also record boundaries: https://github.com/apache/parquet-format/blob/54e6133e887a6ea90501ddd72fff5312b7038a7c/src/main/thrift/parquet.thrift#L532 DataPageHeader(V1) doesn't have this statement, which means that in theory it allows records to span through multiple pages. Is it really the case, or is it something that is missing from the specification? I ask this because filtering pages based on the page index is much more simple if page boundaries are record boundaries as well. Thanks, Zoltan
