Marcel Kornacker has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 9: (16 comments) http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 134: void EncodeColumnStats(ColumnMetaData* meta_data) { find a better name. 'column stats' is not a thrift concept. these are specifically row group stats. Line 236: // Created and set by the derived class. owner? same for the other pointer members. Line 339: int64_t encoded_value_size_; this seems to be the plain encoding size. even for dict-encoded cols? Line 347: // Tracks statistics per row group. This gets reset when starting a new file. hopefully when starting a new row group Line 643: DCHECK(page_stats_base_ != nullptr); how does this handle unsupported types? Line 1028: columns_[i]->EncodeColumnStats(¤t_row_group_->columns[i].meta_data); where do the row group stats get reset? http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: Line 103: /// Maximum statistics size. If the combined size of the min and max values of does this refer to a single thrift Statistics struct? if so, spell that out. http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 65: void EncodeToThrift(T* parent) const { this feels more convoluted than it needs to be. i think it would be better for this class only to deal with thrift::Statistics and let the caller make the appropriate __set_xxx call (which means you won't need a templatized function). Line 88: // We explicitly require types to be listed here in order to support column statistics. i don't understand, i thought those listed types are specifically not supported. what exactly does this do? Line 90: // follow the ordering semantics of parquet's min/max statistics for the new type. what are the ordering semantics? (that order as byte sequence == value order?) Line 97: T>::type; i find the formatting hard to decipher. please reformat by hand (for instance, by move the first is_arithmetic to a new line, which would make the argument grouping clearer). Line 127: // statistics behavior from any implicit behavior of the types? but shouldn't the stats reflect the behavior of the underlying types. ie, why should the stats '<' be any different than the '<' of the underlying type? Line 148: /// Encodes a single value into an output string using parquet's plain encoding. 'an output string' makes it sound like this gets converted into a string type, ie, byte_array in parquet parlance. but plain encoding requires int32, int64, etc., parquet types. you're encoding as 'plain', stored in a binary string. best to make that clear in the comment. (also, what does 'output' mean here?) Line 159: return encoded_value_size_ < 0 ? ParquetPlainEncoder::ByteSize<T>(v) : reformat by hand http://gerrit.cloudera.org:8080/#/c/5611/9/be/src/exec/parquet-common.h File be/src/exec/parquet-common.h: Line 89: static int ByteSize(const T& v) { return sizeof(T); } does this function make sense at all? why not simply call sizeof()? http://gerrit.cloudera.org:8080/#/c/5611/9/tests/util/get_parquet_metadata.py File tests/util/get_parquet_metadata.py: Line 90: """Decode parquet statistics values that are encoded with PLAIN encoding.""" "that are encoded": do you mean "expects 'value' to be plain encoded"? also, why is this specific to stats (as opposed to any plain-encoded value)? -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Michael Brown <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
