Lars Volker has posted comments on this change.

Change subject: IMPALA-4815, IMPALA-4817, IMPALA-4819: Populate Parquet 
Statistics for remaining types
......................................................................


Patch Set 4:

(19 comments)

Thank you for your comments. Please see my replies and PS4.

http://gerrit.cloudera.org:8080/#/c/6563/2//COMMIT_MSG
Commit Message:

Line 25: test using that file.
> We'll probably need to check in some files written with the old stats, righ
I created a file using Hive and added tests to read from it.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 541:     bool stats_read = false;
> bad variable name: this is a plain-encoded value. thrift_stats sounds like 
Done


Line 546:     if (fn_name == "lt" || fn_name == "le") {
> why did you break this up instead of having ReadFromThrift do the extra wor
Reworked it so it resembles the old flow. The additional check whether to read 
deprecated stats is now inside ReadFromThrift.


Line 556:     }
> explicit comparison
Done


Line 559:   }
> Are you going to fix that in the CR?
We don't handle unsigned columns in parquet files at all, so I'm currently 
leaning towards not handling unsigned stats, too. It looks like PR46 will not 
contain the ColumnOrder field after all. We should revisit this when we add 
support for unsigned logical types to the Parquet scanners.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: /// This class, together with its derivatives, is used to compute 
column statistics when
> track is really not a meaningful term here, it generally just means 'follow
Done


Line 36: /// We currently support writing the 'min_value' and 'max_value' 
fields in
> hopefully also min and max, no? tracking means reading/decoding.
Done


Line 44: /// - Numeric values (BOOLEAN, INT, FLOAT, DOUBLE, DECIMAL) are 
ordered by their numeric
> Add a comment to describe ordering for strings, to be consistent with speci
Done


Line 46: ///
> why is that still not the case?
Done


Line 66:   virtual ~ColumnStatsBase() {}
> you changed the type and meaning of a parameter, but you didn't change the 
Done


Line 72:   static bool ReadFromThrift(const parquet::ColumnChunk& col_chunk,
> unclear what this does, because copies are usually returned or passed into 
Done


Line 146:   virtual int64_t BytesNeeded() const override;
> why can't this be collapsed into a 2-level hierarchy?
Done


Line 150:   /// Encodes a single value using parquet's PLAIN encoding and 
stores it into the
> I'm confused why we need a three-level hierarchy with additional template s
Done


Line 151:   /// binary string 'out'.
> protected follows public section
Done


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 34: inline void ColumnStats<T>::Update(const T& v) {
> hard to compare, please move the functions back to where they were. feel fr
Done


Line 103: inline void ColumnStats<bool>::EncodeValueToString(
> Fixed?
Yes, this seems to be the way to go.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 243:     string version_string = tokens[2];
> col_idx is unused.
We want to interpret the statistics based on the type of the column from our 
metadata, which cannot be inferred from the parquet column type alone. It seems 
safer to me to rely on our internal types here. Checking that the types found 
in the parquet file are compatible is done in parquet-metadata-utils.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.h
File be/src/exec/parquet-metadata-utils.h:

Line 60:   /// (<major>.<minor>.<patch>). Unspecified parts default to 0, and 
extra parts are
> this sounds like it's doing some reading.
Done


http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 344:   BROTLI = 4;
> do we return an error when we see that codec?
Yes, we would catch that in ParquetMetadataUtils::ValidateColumn() where we 
validate the CompressionCodec against a list of supported codecs.


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to