Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 206:     // levels data and the encoded values.  If compression is 
enabled, this is the
> Yeah I think it would be good to stamp out. I'm not sure how likely we are 
I used std::enable_if to ensure compile errors for unsupported types instead of 
linker errors.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 355:   // Tracks statistics per column. This gets reset when starting a 
new file.
> I feel like the page/column terminology is confusing here. If page_stats_ i
It is per row group, which coincides with files in our implementation. I 
renamed it accordingly.


Line 974:   for (int j = 0; j < columns_.size(); ++j) 
columns_[j]->CopyStatsMemory();
> Why j?
Done


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

Line 104:   /// them.
> Is this per value? I.e. you can have min + max add up to 8kb.
This is the combined size, I updated the comment to make it more clear.


http://gerrit.cloudera.org:8080/#/c/5611/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 39: /// TIMESTAMP values are written in the in-memory format used by 
Impala, relative to UTC,
> I agree there's no logical timestamp type, but the physical type is still a
I just noticed that I commented on this on the top-level discussion. Apologies 
for the confusion.


PS2, Line 88: initialized_
> values_initialized_? 
I renamed it to has_values_ which is a bit shorter. Let me know if you feel 
strongly about values_initialized_


Line 165:   using TypedColumnStatsBase<T>::initialized_;
> I don't understand why these "using"s are necessary.
They are necessary to access these members of the templated base class. An 
alternative would be using this-> to access them, but I reasoned that using 
'using' makes the rest of the core easier to read. I added a comment.


PS2, Line 173: internal
> Remove "internal"? I don't think it's necessary.
Done


PS2, Line 266: && value->len == buffer->len()
> I think the len check is redundant, right? It shouldn't point to the buffer
Yes, good catch.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Ivanfi <[email protected]>
Gerrit-HasComments: Yes

Reply via email to