Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

(12 comments)

I need to read through the test code but I thought I'd flush out my comments 
about the implementation. I think I understand how it all works now, but it 
would be good to make it a bit easier to follow.

http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 206:     // levels data and the encoded values.  If compression is 
enabled, this is the
> I added a comment to the class header. To me it looks safe to rely on the o
Yeah I think it would be good to stamp out. I'm not sure how likely we are to 
run into problems but it's hard to anticipate what new data types we might add.


Line 598: 
> I made the *_stats_ inline members, so their default initialization is suff
Ah ok, didn't realise that was the reason. I'll take a look at the new code.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 355:   // Tracks statistics per column. This gets reset when starting a 
new file.
I feel like the page/column terminology is confusing here. If page_stats_ is 
per page, shouldn't this be per file?

Both of these are stats for the column, right?


Line 974:   for (int j = 0; j < columns_.size(); ++j) 
columns_[j]->CopyStatsMemory();
Why j?

for (BaseColumnWriter* column : columns_) would be more readable.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 104:   /// them.
Is this per value? I.e. you can have min + max add up to 8kb.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 33: /// Regarding the ordering of values, we follow the parquet-mr 
reference implementation.
Thanks, this is very helpful.


PS2, Line 35: DECIMAL
We write decimals as a big-endian FIXED_LEN_BYTE_ARRAY type, so the 
lexicographic order will match the numeric order. We should call this out 
though since it's a bit subtle.

We should also mention that we don't support other encodings: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal

Now that I think about it, decimal encoded into variable-length binary might be 
an example where the logical ordering doesn't match the physical ordering, 
since 256 => "0x01 0x00" would be sorted before 1 => "0x01" (if I understood 
everything correctly).


Line 39: /// TIMESTAMP values are written in the in-memory format used by 
Impala, relative to UTC,
I think we convert them to INT96 for parquet. I think the INT96 order is 
equivalent to our timestamp order, but we should be clear on this.


PS2, Line 88: initialized_
values_initialized_? 

We use initialized_ elsewhere in the backend to indicate whether an Init() 
method has been called, so I initially thought this was the same thing when 
reading the code.


Line 165:   using TypedColumnStatsBase<T>::initialized_;
I don't understand why these "using"s are necessary.


PS2, Line 173: internal
Remove "internal"? I don't think it's necessary.


PS2, Line 266: && value->len == buffer->len()
I think the len check is redundant, right? It shouldn't point to the buffer 
unless it's the right length already I think.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to