Lars Volker has posted comments on this change. Change subject: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer ......................................................................
Patch Set 2: (12 comments) Thank you for working on this! http://gerrit.cloudera.org:8080/#/c/7058/2//COMMIT_MSG Commit Message: Line 7: IMPALA-5061: Populate null_count in parquet::statistics in the parquet table writer Nit: Make the first line of the commit message fit in 80 chars. You can end it after "statistics" http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 553: // We need to get min_value. Let's only do the write path in this change and then address the read path in a subsequent change. http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS2, Line 198: {} Do you need the default implementation here? Without it, you could also remove the "Implemented in.." comment, since then it'd be clear. http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h File be/src/exec/parquet-column-stats.h: Line 79: static bool ReadCountFromThrift( I think it will be easier to return an int64 here. However we should do the read path in a later change altogether. Line 115: bool has_null_count_; Why do we need has_null_count_ instead of just always counting them? Are there types for which we don't know the number of nulls we wrote? Line 170: void UpdateNullCount(); Can this go into the base class? It doesn't depend on T. http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.inline.h File be/src/exec/parquet-column-stats.inline.h: Line 50: null_count_++; nit: we use prefix operators, mostly for consistency, but sometimes for performance reasons. http://gerrit.cloudera.org:8080/#/c/7058/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 323: if ((stats.min_value is None and stats.max_value is None) With the additional checks below, is this line still needed? What is its intent? Please remove the trailing whitespace Line 328: min_value = max_value = null_count = None You should have one assignment per line Line 427: ColumnStats('id', 0, 7299, None), Why are these None and not 0? I think we should count 0s even if there are none. PS2, Line 607: Replace tabs with spaces, here and elsewhere Line 620: "functional_parquet.zipcode_incomes", expected_min_max_values) The formatting here seems off -- To view, visit http://gerrit.cloudera.org:8080/7058 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-HasComments: Yes
