Lars Volker has posted comments on this change.

Change subject: IMPALA-5061: Populate null_count in parquet::statistics in the 
parquet table writer
......................................................................


Patch Set 2:

(12 comments)

Thank you for working on this!

http://gerrit.cloudera.org:8080/#/c/7058/2//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5061: Populate null_count in parquet::statistics in the parquet 
table writer
Nit: Make the first line of the commit message fit in 80 chars. You can end it 
after "statistics"


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 553:       // We need to get min_value.
Let's only do the write path in this change and then address the read path in a 
subsequent change.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS2, Line 198: {}
Do you need the default implementation here? Without it, you could also remove 
the "Implemented in.." comment, since then it'd be clear.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.h
File be/src/exec/parquet-column-stats.h:

Line 79:   static bool ReadCountFromThrift(
I think it will be easier to return an int64 here. However we should do the 
read path in a later change altogether.


Line 115:   bool has_null_count_;
Why do we need has_null_count_ instead of just always counting them? Are there 
types for which we don't know the number of nulls we wrote?


Line 170:   void UpdateNullCount();
Can this go into the base class? It doesn't depend on T.


http://gerrit.cloudera.org:8080/#/c/7058/2/be/src/exec/parquet-column-stats.inline.h
File be/src/exec/parquet-column-stats.inline.h:

Line 50:   null_count_++;
nit: we use prefix operators, mostly for consistency, but sometimes for 
performance reasons.


http://gerrit.cloudera.org:8080/#/c/7058/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 323:       if ((stats.min_value is None and stats.max_value is None) 
With the additional checks below, is this line still needed? What is its intent?

Please remove the trailing whitespace


Line 328:       min_value = max_value = null_count = None
You should have one assignment per line


Line 427:         ColumnStats('id', 0, 7299, None),
Why are these None and not 0? I think we should count 0s even if there are none.


PS2, Line 607:  
Replace tabs with spaces, here and elsewhere


Line 620:                                                                       
"functional_parquet.zipcode_incomes", expected_min_max_values)
The formatting here seems off


-- 
To view, visit http://gerrit.cloudera.org:8080/7058
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4c49a63af84c2234f0633be63206cb52eb7e8ebb
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pooja Nilangekar <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-HasComments: Yes

Reply via email to