Tim Armstrong has posted comments on this change. Change subject: IMPALA-4817: Populate Parquet Statistics for Strings ......................................................................
Patch Set 2: (7 comments) I had a few high-level comments/questions. http://gerrit.cloudera.org:8080/#/c/6563/2//COMMIT_MSG Commit Message: Line 25: TODO: This change still needs tests reading statistics from files which We'll probably need to check in some files written with the old stats, right? Since we'll switch the version of Hive at some point. Might be best to just do this now so that we've got the test coverage. http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 559: // TODO: Here we need to pass in the column order (signed vs unsigned). Are you going to fix that in the CR? http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 44: /// value (as opposed to their binary representation). Add a comment to describe ordering for strings, to be consistent with specifying these orders. Line 150: class ColumnStats : public TypedColumnStatsBase<T> { I'm confused why we need a three-level hierarchy with additional template specialisation. I don't understand what having the TypedColumnStatsBase level to the hierarchy buys us. If we're going to use template specialisation it seems like we should be able to collapse ColumnStats and TypedColumnStatsBase into one somehow http://gerrit.cloudera.org:8080/#/c/6563/2/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 103: // TODO: How to do memory management here? Fixed? http://gerrit.cloudera.org:8080/#/c/6563/1/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 437: """Test that Impala does not write statistics for char columns.""" Comment out-of-date? Line 445: ] Can we add some tests involving "interesting" characters. E.g. '\0' and some bytes >= 128. I think these should "just work" in our implementation but it would be good to have the coverage, since these things were broken with the old Hive stats. -- To view, visit http://gerrit.cloudera.org:8080/6563 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3ef4a5d25a57c82577fd498d6d1c4297ecf39312 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
