Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12974 )
Change subject: IMPALA-7608: Estimate row count from file size when no stats available ...................................................................... Patch Set 9: (9 comments) Thanks for the patience with this. I have a few more asks but this is looking pretty good. http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@172 PS9, Line 172: // of the file before compression. Can you leave a comment explaining briefly how the estimates were produced. This is just useful in case someone needs to update the estimates. http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@310 PS9, Line 310: nit: only need one blank line http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1230 PS9, Line 1230: //Compute the estimated table size when taking compression into consideration Please use a javadoc method comment, i.e. /** */ http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1241 PS9, Line 1241: estimatedPartitionSize = estimatedPartitionSize nit: to be more concise, estimatedPartitionSize += ... http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1244 PS9, Line 1244: } else {// When the text file is not compressed. Nit: comment placement here and on l1250 is a little non-standard. It's ok, but maybe just move it to the next line for consistency? http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1258 PS9, Line 1258: throw new RuntimeException("Unknown Hdfs compressed format: " Could write this more densely as: if (VALID_LEGACY_FORMATS.contains(format)) { estimatedPartitionSize = estimatedPartitionSize + Math.round(p.getSize() * ESTIMATED_COMPRESSION_FACTOR_LEGACY); } else { Preconditions.checkState((VALID_COLUMNAR_FORMATS.contains(format), "Unknown HDFS compressed format: %s", this)); } http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1262 PS9, Line 1262: estimatedTableSize = estimatedTableSize + nit: to be more concise, estimatedTableSize += estimatedPartitionSize http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java File fe/src/test/java/org/apache/impala/planner/CardinalityTest.java: http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@43 PS9, Line 43: tolerance Constant should be upper case. It's also not 100% obvious this is the cardinality tolerance. So maybe CARDINALITY_TOLERANCE http://gerrit.cloudera.org:8080/#/c/12974/9/fe/src/test/java/org/apache/impala/planner/CardinalityTest.java@723 PS9, Line 723: expected, planRoot.getCardinality(), expected * tolerance); It looks like we made existing tests looser. I think we should keep those tests strict, i.e. enforce that cardinality exactly matches, and use approximate checks for the new tests where estimates are based on files sizes. It would be good if the method names reflected the behaviour, e.g. verifyApproxCardinality(). Otherwise someone reading the tests would assume that the checks are exact. -- To view, visit http://gerrit.cloudera.org:8080/12974 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic414121c8df0d5222e4aeea096b5365beb04568a Gerrit-Change-Number: 12974 Gerrit-PatchSet: 9 Gerrit-Owner: Fang-Yu Rao <fangyu....@cloudera.com> Gerrit-Reviewer: Fang-Yu Rao <fangyu....@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Paul Rogers <prog...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Wed, 05 Jun 2019 22:45:50 +0000 Gerrit-HasComments: Yes