Lars Volker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 2: (9 comments) http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 206: // levels data and the encoded values. If compression is enabled, this is the > Yeah I think it would be good to stamp out. I'm not sure how likely we are I used std::enable_if to ensure compile errors for unsupported types instead of linker errors. http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 355: // Tracks statistics per column. This gets reset when starting a new file. > I feel like the page/column terminology is confusing here. If page_stats_ i It is per row group, which coincides with files in our implementation. I renamed it accordingly. Line 974: for (int j = 0; j < columns_.size(); ++j) columns_[j]->CopyStatsMemory(); > Why j? Done http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: Line 104: /// them. > Is this per value? I.e. you can have min + max add up to 8kb. This is the combined size, I updated the comment to make it more clear. http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 39: /// TIMESTAMP values are written in the in-memory format used by Impala, relative to UTC, > I agree there's no logical timestamp type, but the physical type is still a I just noticed that I commented on this on the top-level discussion. Apologies for the confusion. PS2, Line 88: initialized_ > values_initialized_? I renamed it to has_values_ which is a bit shorter. Let me know if you feel strongly about values_initialized_ Line 165: using TypedColumnStatsBase<T>::initialized_; > I don't understand why these "using"s are necessary. They are necessary to access these members of the templated base class. An alternative would be using this-> to access them, but I reasoned that using 'using' makes the rest of the core easier to read. I added a comment. PS2, Line 173: internal > Remove "internal"? I don't think it's necessary. Done PS2, Line 266: && value->len == buffer->len() > I think the len check is redundant, right? It shouldn't point to the buffer Yes, good catch. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
