Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4817: Populate Parquet Statistics for Strings
......................................................................


Patch Set 2:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 541:     const string* thrift_stats = nullptr;
bad variable name: this is a plain-encoded value. thrift_stats sounds like it's 
a struct (it's definitely not something that requires a plural).


Line 546:       thrift_stats = ParquetMetadataUtils::GetThriftStats(
why did you break this up instead of having ReadFromThrift do the extra work? 
do you need GetThriftStats anywhere else? the old control flow was easier to 
follow.


Line 556:     if (!thrift_stats) continue;
explicit comparison


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 31: /// This class, together with its derivatives, is used to track column 
statistics when
track is really not a meaningful term here, it generally just means 'follow'. 
revise description.


Line 36: /// We currently support tracking 'min_value' and 'max_value' values 
for statistics. The
hopefully also min and max, no? tracking means reading/decoding.


Line 46: /// We currently don't write statistics for DECIMAL values and 
TIMESTAMP values due to
why is that still not the case?


Line 66:       const string& thrift_stats, const ColumnType& col_type, void* 
slot);
you changed the type and meaning of a parameter, but you didn't change the name.


Line 72:   /// Creates a copy of the contents of this object. Some data types 
(e.g. StringValue)
unclear what this does, because copies are usually returned or passed into 
something.


Line 146: /// This class contains further type-specific behavior that is common 
only to a subset of
why can't this be collapsed into a 2-level hierarchy?


Line 151:  protected:
protected follows public section


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 34: inline int64_t TypedColumnStatsBase<T>::BytesNeeded() const {
hard to compare, please move the functions back to where they were. feel free 
to reorder at the end of the review cycle.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

Line 243:     int col_idx, const StatsField& stats_field, const ColumnType& 
col_type) {
col_idx is unused.

instead of passing in both the columnchunk and columntype, why not use 
col_chunk.meta_data.type?


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-metadata-utils.h
File be/src/exec/parquet-metadata-utils.h:

Line 60:   static bool ReadOldStats(const ColumnType& col_type);
this sounds like it's doing some reading.

also, 'deprecated' instead of old.


http://gerrit.cloudera.org:8080/#/c/6563/2/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 344:   BROTLI = 4;
do we return an error when we see that codec?


Line 567: /** Union containing the order used for min, max, and sorting values 
in a column
why isn't this implied by the logical type of that column?

if this is simply to mark "legacy" ordered columns, then why not simply have a 
bool here?


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-HasComments: Yes

Reply via email to