Lars Volker has posted comments on this change. Change subject: IMPALA-5061: Populate null_count in parquet::statistics ......................................................................
Patch Set 4: (13 comments) http://gerrit.cloudera.org:8080/#/c/7058/4//COMMIT_MSG Commit Message: Line 10: value is encountered by paquet table writer. The value is written nit typo http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 197: if(page_stats_base_ != nullptr) { nit: single line http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 79: static bool ReadCountFromThrift( > If the return type is int64, what would the return value be in case the sca For frequently called functions it's often better to return a bool instead of the more heavy-weight Status. In this case you would probably pass an int64_t* to return the value. Let's remove the method here altogether and re-introduce it in a change that handles reading the stats. http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: PS4, Line 63: NULL_COUNT remove PS4, Line 105: row column? PS4, Line 105: Initilizes typo PS4, Line 116: NULL nit: lowercase http://gerrit.cloudera.org:8080/#/c/7058/4/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 32: inline void ColumnStatsBase::UpdateNullCount() { It looks like this will fit into one line in the header. If so, please move it there. Line 186: if (cs->has_values_) { You could also change the interface of Update() to take a min and max parameter and then clean up this method by calling it. You don't have to do it in this change though if you think it's too much. Line 205: null_count_ += cs->null_count_; If you add a "int64_t count" parameter to the UpdateNullCount method, you can call it from here. http://gerrit.cloudera.org:8080/#/c/7058/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 427: ColumnStats('bigint_col', 0, 90, 0), > Done Cool :) http://gerrit.cloudera.org:8080/#/c/7058/4/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 324: null_count = None How about moving this to L330, initialize it from stats.null_count and then check it's not None? Line 480: """Test that we don't write min/max statistics for null columns. Ensure null_count Please remove the trailing whitespace. You can also set-up git in a way that it warns about trailing whitespace during commit time. Even better, you may be able to configure your editor to highlight it for you. -- To view, visit http://gerrit.cloudera.org:8080/7058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Pooja Nilangekar <[email protected]> Gerrit-HasComments: Yes
