Lars Volker has posted comments on this change. Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet Statistics for remaining types ......................................................................
Patch Set 4: (19 comments) Thank you for your comments. Please see my replies and PS4. http://gerrit.cloudera.org:8080/#/c/6563/2//COMMIT_MSG Commit Message: Line 25: test using that file. > We'll probably need to check in some files written with the old stats, righ I created a file using Hive and added tests to read from it. http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 541: bool stats_read = false; > bad variable name: this is a plain-encoded value. thrift_stats sounds like Done Line 546: if (fn_name == "lt" || fn_name == "le") { > why did you break this up instead of having ReadFromThrift do the extra wor Reworked it so it resembles the old flow. The additional check whether to read deprecated stats is now inside ReadFromThrift. Line 556: } > explicit comparison Done Line 559: } > Are you going to fix that in the CR? We don't handle unsigned columns in parquet files at all, so I'm currently leaning towards not handling unsigned stats, too. It looks like PR46 will not contain the ColumnOrder field after all. We should revisit this when we add support for unsigned logical types to the Parquet scanners. http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 31: /// This class, together with its derivatives, is used to compute column statistics when > track is really not a meaningful term here, it generally just means 'follow Done Line 36: /// We currently support writing the 'min_value' and 'max_value' fields in > hopefully also min and max, no? tracking means reading/decoding. Done Line 44: /// - Numeric values (BOOLEAN, INT, FLOAT, DOUBLE, DECIMAL) are ordered by their numeric > Add a comment to describe ordering for strings, to be consistent with speci Done Line 46: /// > why is that still not the case? Done Line 66: virtual ~ColumnStatsBase() {} > you changed the type and meaning of a parameter, but you didn't change the Done Line 72: static bool ReadFromThrift(const parquet::ColumnChunk& col_chunk, > unclear what this does, because copies are usually returned or passed into Done Line 146: virtual int64_t BytesNeeded() const override; > why can't this be collapsed into a 2-level hierarchy? Done Line 150: /// Encodes a single value using parquet's PLAIN encoding and stores it into the > I'm confused why we need a three-level hierarchy with additional template s Done Line 151: /// binary string 'out'. > protected follows public section Done http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 34: inline void ColumnStats<T>::Update(const T& v) { > hard to compare, please move the functions back to where they were. feel fr Done Line 103: inline void ColumnStats<bool>::EncodeValueToString( > Fixed? Yes, this seems to be the way to go. http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.cc File be/src/exec/parquet-metadata-utils.cc: Line 243: string version_string = tokens[2]; > col_idx is unused. We want to interpret the statistics based on the type of the column from our metadata, which cannot be inferred from the parquet column type alone. It seems safer to me to rely on our internal types here. Checking that the types found in the parquet file are compatible is done in parquet-metadata-utils. http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.h File be/src/exec/parquet-metadata-utils.h: Line 60: /// (<major>.<minor>.<patch>). Unspecified parts default to 0, and extra parts are > this sounds like it's doing some reading. Done http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift File common/thrift/parquet.thrift: Line 344: BROTLI = 4; > do we return an error when we see that codec? Yes, we would catch that in ParquetMetadataUtils::ValidateColumn() where we validate the CompressionCodec against a list of supported codecs. -- To view, visit http://gerrit.cloudera.org:8080/6563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
