pitrou commented on code in PR #14351:
URL: https://github.com/apache/arrow/pull/14351#discussion_r1046076725
##########
cpp/src/parquet/column_reader.cc:
##########
@@ -400,6 +401,19 @@ std::shared_ptr<Page> SerializedPageReader::NextPage() {
ParquetException::EofException(ss.str());
}
+ const PageType::type page_type = LoadEnumSafe(¤t_page_header_.type);
+
+ if (properties_.use_page_checksum_verification() &&
+ page_type == PageType::DATA_PAGE && current_page_header_.__isset.crc) {
Review Comment:
> * DATA_PAGE_V2's checksum need to considering rep-level and def-levels,
which may doesn't share the same processing logic
That doesn't seem to be the case to me. The wording in the spec is just
confusingly complicated, while the reality is that the CRC32 is run on the data
in the form it's written to disk (therefore, including any optional compression
and encryption).
> parquet-mr doesn't implement checksum over DATA_PAGE_V1. For implement C++
code, it doesn't matter, but if we will lack the ability of cross-validation.
So maybe later I'd like to add in parquet-mr first.
I don't understand what that would change? Whether you add it first to
parquet-mr or to this implementation, you'll have the problem that the first
implementation to get it will not be able to cross-validate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]