Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 9:

(17 comments)

Thanks for the review. Please see my inline comments and the new PS.

http://gerrit.cloudera.org:8080/#/c/5611/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS6, Line 178: ProcessValue
> that was just an off-the-cuff suggestion. if 'append' is more specific, the
I tend to favor ProcessValue, since Append (like Encode) leaves out the 
statistics part. While ProcessValue is vague, it implies that the values is 
"consumed" afterwards, and that it may be necessary to look at the comment for 
further details. That being said, how about "ConsumeValue"? Does that imply 
ownership transfer too much?


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 134:   void EncodeColumnStats(ColumnMetaData* meta_data) {
> find a better name. 'column stats' is not a thrift concept. these are speci
Done


Line 236:   // Created and set by the derived class.
> owner? same for the other pointer members.
Done


Line 339:   int64_t encoded_value_size_;
> this seems to be the plain encoding size. even for dict-encoded cols?
Yes, the dictionary encoders use it as the size of the encoded values. Should 
we rename it to plain_encoded_value_size_?


Line 347:   // Tracks statistics per row group. This gets reset when starting a 
new file.
> hopefully when starting a new row group
Yes, Done.


Line 643:   DCHECK(page_stats_base_ != nullptr);
> how does this handle unsupported types?
Currently the ColumnStats<T> gets created for all supported types, and the 
ColumnStats c'tor ensures that the class can only be instantiated for supported 
types. However, for some types we don't write out the statistics to the 
resulting parquet file by disabling the Update() logic, thus ensuring that the 
stats objects stay empty. This allows us to keep the interface of ColumnStats 
simpler.


Line 1028:     
columns_[i]->EncodeColumnStats(&current_row_group_->columns[i].meta_data);
> where do the row group stats get reset?
Line 281 of this PS.


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 103:   /// Maximum statistics size. If the combined size of the min and 
max values of
> does this refer to a single thrift Statistics struct? if so, spell that out
Done


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 65:   void EncodeToThrift(T* parent) const {
> this feels more convoluted than it needs to be. i think it would be better 
Done


Line 88:   // We explicitly require types to be listed here in order to support 
column statistics.
> i don't understand, i thought those listed types are specifically not suppo
We don't update statistics for these types, but creating ColumnStats for them 
is still supported. This is because we want to support them soon. If we 
disallow creating them, then the caller will have to deal with only creating 
ColumnStats for supported column types.


Line 90:   // follow the ordering semantics of parquet's min/max statistics for 
the new type.
> what are the ordering semantics? (that order as byte sequence == value orde
The values should be ordered according to what parquet-mr says (which is 
hopefully in line with what Hive and Spark do). However, there is currently a 
lack of specification, especially for logical types, that we're working on 
clarifying with the parquet-format project. I added a comment explaining where 
to look for the ordering semantics.


Line 97:       T>::type;
> i find the formatting hard to decipher. please reformat by hand (for instan
Done


Line 127:       // statistics behavior from any implicit behavior of the types?
> but shouldn't the stats reflect the behavior of the underlying types. ie, w
The stats should follow the ordering semantics defined in parquet-format. Those 
are currently underspecified, we're working on getting them fixed so we can 
write stats for other types, too (see my other comments in this review). While 
I agree that they should ideally be the same, I don't think we can guarantee or 
enforce that, since ultimately the parquet-format project serves as the 
reference.


Line 148:   /// Encodes a single value into an output string using parquet's 
plain encoding.
> 'an output string' makes it sound like this gets converted into a string ty
Done


Line 159:     return encoded_value_size_ < 0 ? 
ParquetPlainEncoder::ByteSize<T>(v) :
> reformat by hand
Done


http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

Line 89:   static int ByteSize(const T& v) { return sizeof(T); }
> does this function make sense at all? why not simply call sizeof()?
It is specialized for several types below. Should I add a comment here?


http://gerrit.cloudera.org:8080/#/c/5611/9/tests/util/get_parquet_metadata.py
File tests/util/get_parquet_metadata.py:

Line 90:   """Decode parquet statistics values that are encoded with PLAIN 
encoding."""
> "that are encoded": do you mean "expects 'value' to be plain encoded"?
I changed the comment. The notable difference between statistics and stats is 
that this method does only support single-bit boolean values. Additionally, 
binary arrays in statistics don't follow plain encoding semantics but are 
stored directly in the thrift objects. I extended the comment to make it more 
clear.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Michael Brown <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to