Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12974 )

Change subject: IMPALA-7608: Estimate row count from file size when no stats 
available
......................................................................


Patch Set 9:

(9 comments)

Thanks for the patience with this. I have a few more asks but this is looking 
pretty good.

http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@172
PS9, Line 172:   // of the file before compression.
Can you leave a comment explaining briefly how the estimates were produced. 
This is just useful in case someone needs to update the estimates.


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@310
PS9, Line 310:
nit: only need one blank line


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1230
PS9, Line 1230:   //Compute the estimated table size when taking compression 
into consideration
Please use a javadoc method comment, i.e. /** */


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1241
PS9, Line 1241:             estimatedPartitionSize = estimatedPartitionSize
nit:  to be more concise,

  estimatedPartitionSize += ...


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1244
PS9, Line 1244:           } else {// When the text file is not compressed.
Nit: comment placement here and on l1250 is a little non-standard. It's ok, but 
maybe just move it to the next line for consistency?


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1258
PS9, Line 1258:           throw new RuntimeException("Unknown Hdfs compressed 
format: "
Could write this more densely as:


  if (VALID_LEGACY_FORMATS.contains(format)) {
          estimatedPartitionSize = estimatedPartitionSize
              + Math.round(p.getSize() * ESTIMATED_COMPRESSION_FACTOR_LEGACY);
        } else {
         Preconditions.checkState((VALID_COLUMNAR_FORMATS.contains(format), 
"Unknown HDFS compressed format: %s", this));
  }


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1262
PS9, Line 1262:  estimatedTableSize = estimatedTableSize +
nit:  to be more concise,

  estimatedTableSize += estimatedPartitionSize


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java
File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java:

http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@43
PS9, Line 43: tolerance
Constant should be upper case. It's also not 100% obvious this is the 
cardinality tolerance. So maybe CARDINALITY_TOLERANCE


http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@723
PS9, Line 723:         expected, planRoot.getCardinality(), expected * 
tolerance);
It looks like we made existing tests looser. I think we should keep those tests 
strict, i.e. enforce that cardinality exactly matches, and use approximate 
checks for the new tests where estimates are based on files sizes. It would be 
good if the method names reflected the behaviour, e.g. 
verifyApproxCardinality().

Otherwise someone reading the tests would assume that the checks are exact.



--
To view, visit http://gerrit.cloudera.org:8080/12974
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a
Gerrit-Change-Number: 12974
Gerrit-PatchSet: 9
Gerrit-Owner: Fang-Yu Rao <fangyu....@cloudera.com>
Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Wed, 05 Jun 2019 22:45:50 +0000
Gerrit-HasComments: Yes

Reply via email to