Tim Armstrong has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 2: (12 comments) I need to read through the test code but I thought I'd flush out my comments about the implementation. I think I understand how it all works now, but it would be good to make it a bit easier to follow. http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 206: // levels data and the encoded values. If compression is enabled, this is the > I added a comment to the class header. To me it looks safe to rely on the o Yeah I think it would be good to stamp out. I'm not sure how likely we are to run into problems but it's hard to anticipate what new data types we might add. Line 598: > I made the *_stats_ inline members, so their default initialization is suff Ah ok, didn't realise that was the reason. I'll take a look at the new code. http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 355: // Tracks statistics per column. This gets reset when starting a new file. I feel like the page/column terminology is confusing here. If page_stats_ is per page, shouldn't this be per file? Both of these are stats for the column, right? Line 974: for (int j = 0; j < columns_.size(); ++j) columns_[j]->CopyStatsMemory(); Why j? for (BaseColumnWriter* column : columns_) would be more readable. http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.h File be/src/exec/hdfs-parquet-table-writer.h: Line 104: /// them. Is this per value? I.e. you can have min + max add up to 8kb. http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 33: /// Regarding the ordering of values, we follow the parquet-mr reference implementation. Thanks, this is very helpful. PS2, Line 35: DECIMAL We write decimals as a big-endian FIXED_LEN_BYTE_ARRAY type, so the lexicographic order will match the numeric order. We should call this out though since it's a bit subtle. We should also mention that we don't support other encodings: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal Now that I think about it, decimal encoded into variable-length binary might be an example where the logical ordering doesn't match the physical ordering, since 256 => "0x01 0x00" would be sorted before 1 => "0x01" (if I understood everything correctly). Line 39: /// TIMESTAMP values are written in the in-memory format used by Impala, relative to UTC, I think we convert them to INT96 for parquet. I think the INT96 order is equivalent to our timestamp order, but we should be clear on this. PS2, Line 88: initialized_ values_initialized_? We use initialized_ elsewhere in the backend to indicate whether an Init() method has been called, so I initially thought this was the same thing when reading the code. Line 165: using TypedColumnStatsBase<T>::initialized_; I don't understand why these "using"s are necessary. PS2, Line 173: internal Remove "internal"? I don't think it's necessary. PS2, Line 266: && value->len == buffer->len() I think the len check is redundant, right? It shouldn't point to the buffer unless it's the right length already I think. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Ivanfi <[email protected]> Gerrit-HasComments: Yes
