Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4817: Populate Parquet Statistics for Strings
......................................................................


Patch Set 2:

(7 comments)

I had a few high-level comments/questions.

http://gerrit.cloudera.org:8080/#/c/6563/2//COMMIT_MSG
Commit Message:

Line 25: TODO: This change still needs tests reading statistics from files which
We'll probably need to check in some files written with the old stats, right? 
Since we'll switch the version of Hive at some point. Might be best to just do 
this now so that we've got the test coverage.


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 559:     // TODO: Here we need to pass in the column order (signed vs 
unsigned).
Are you going to fix that in the CR?


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 44: /// value (as opposed to their binary representation).
Add a comment to describe ordering for strings, to be consistent with 
specifying these orders.


Line 150: class ColumnStats : public TypedColumnStatsBase<T> {
I'm confused why we need a three-level hierarchy with additional template 
specialisation. I don't understand what having the TypedColumnStatsBase level 
to the hierarchy buys us.

If we're going to use template specialisation it seems like we should be able 
to collapse ColumnStats and TypedColumnStatsBase into one somehow


http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 103:   // TODO: How to do memory management here?
Fixed?


http://gerrit.cloudera.org:8080/#/c/6563/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 437:     """Test that Impala does not write statistics for char columns."""
Comment out-of-date?


Line 445:     ]
Can we add some tests involving "interesting" characters. E.g. '\0' and some 
bytes >= 128. I think these should "just work" in our implementation but it 
would be good to have the coverage, since these things were broken with the old 
Hive stats.


-- 
To view, visit http://gerrit.cloudera.org:8080/6563
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to