Marcel Kornacker has posted comments on this change. Change subject: IMPALA-4817: Populate Parquet Statistics for Strings ......................................................................
Patch Set 2: (15 comments) http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 541: const string* thrift_stats = nullptr; bad variable name: this is a plain-encoded value. thrift_stats sounds like it's a struct (it's definitely not something that requires a plural). Line 546: thrift_stats = ParquetMetadataUtils::GetThriftStats( why did you break this up instead of having ReadFromThrift do the extra work? do you need GetThriftStats anywhere else? the old control flow was easier to follow. Line 556: if (!thrift_stats) continue; explicit comparison http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: /// This class, together with its derivatives, is used to track column statistics when track is really not a meaningful term here, it generally just means 'follow'. revise description. Line 36: /// We currently support tracking 'min_value' and 'max_value' values for statistics. The hopefully also min and max, no? tracking means reading/decoding. Line 46: /// We currently don't write statistics for DECIMAL values and TIMESTAMP values due to why is that still not the case? Line 66: const string& thrift_stats, const ColumnType& col_type, void* slot); you changed the type and meaning of a parameter, but you didn't change the name. Line 72: /// Creates a copy of the contents of this object. Some data types (e.g. StringValue) unclear what this does, because copies are usually returned or passed into something. Line 146: /// This class contains further type-specific behavior that is common only to a subset of why can't this be collapsed into a 2-level hierarchy? Line 151: protected: protected follows public section http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 34: inline int64_t TypedColumnStatsBase<T>::BytesNeeded() const { hard to compare, please move the functions back to where they were. feel free to reorder at the end of the review cycle. http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: Line 243: int col_idx, const StatsField& stats_field, const ColumnType& col_type) { col_idx is unused. instead of passing in both the columnchunk and columntype, why not use col_chunk.meta_data.type? http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.h File be/src/exec/parquet-metadata-utils.h: Line 60: static bool ReadOldStats(const ColumnType& col_type); this sounds like it's doing some reading. also, 'deprecated' instead of old. http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift File common/thrift/parquet.thrift: Line 344: BROTLI = 4; do we return an error when we see that codec? Line 567: /** Union containing the order used for min, max, and sorting values in a column why isn't this implied by the logical type of that column? if this is simply to mark "legacy" ordered columns, then why not simply have a bool here? -- To view, visit http://gerrit.cloudera.org:8080/6563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-HasComments: Yes
