[GitHub] [arrow] wgtmac commented on a diff in pull request #14556: PARQUET-2211: [C++] Print ColumnMetaData.encoding_stats field

GitBox Sun, 06 Nov 2022 06:44:16 -0800


wgtmac commented on code in PR #14556:
URL: https://github.com/apache/arrow/pull/14556#discussion_r1014841622



##########
cpp/src/parquet/printer.cc:
##########
@@ -39,6 +39,25 @@ namespace parquet {
 
 class ColumnReader;
 
+namespace {
+
+void PrintPageEncodingStats(std::ostream& stream,
+                            const std::vector<PageEncodingStats>& 
encoding_stats) {
+  for (size_t i = 0; i < encoding_stats.size(); ++i) {
+    const auto& encoding = encoding_stats.at(i);
+    stream << EncodingToString(encoding.encoding);
+    if (encoding.page_type == parquet::PageType::DICTIONARY_PAGE) {
+      // Explicitly tell if this encoding comes from a dictionary page
+      stream << "(DICT_PAGE)";

Review Comment:
   The main idea is to tell this encoding comes from the dictionary page. IIUC, 
both dictionary page and data page use PLAIN_DICTIONARY when dictionary 
encoding is applied in the Parquet 1.0. While in Parquet 2.0, dictionary page 
uses PLAIN and data page uses RLE_DICTIONARY. So it is difficult to tell where 
the PLAIN_DICTIONARY or PLAIN encoding comes from.  Please check this for 
detail:　
https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wgtmac commented on a diff in pull request #14556: PARQUET-2211: [C++] Print ColumnMetaData.encoding_stats field

Reply via email to