Aman Sinha has posted comments on this change. ( http://gerrit.cloudera.org:8080/16842 )
Change subject: IMPALA-10377 Improve the accuracy of resource estimation PlanNode does not consider some factors when estimating memory, this will cause a large error rate ...................................................................... Patch Set 10: (5 comments) Thanks for removing the extraneous test files. I am sending some initial comments based on a quick look. Will send more later. http://gerrit.cloudera.org:8080/#/c/16842/10//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16842/10//COMMIT_MSG@7 PS10, Line 7: IMPALA-10377 Improve the accuracy of resource estimation Few general suggestions on the commit message: - The 'IMPALA-xxxx' should be followed by a colon ':'. - The lines are wider than the 90 chars limit that checkstyle enforces..I am not sure if this passed the checkstyle but would be useful to re-format to conform to the per line limit. - Add a short summary of the Testing changes http://gerrit.cloudera.org:8080/#/c/16842/10//COMMIT_MSG@25 PS10, Line 25: 1.Estimate memory by scanning all columns,the estimated memory is much larger than the actual use. Is this describing the original issue or the nature of the fix itself ? I assume it is the original issue. .. in that case can you rephrase to avoid confusion ? Something like 'Current estimate is based on all columns being read which over-estimates the memory. This patch fixes it by ...' Actually, the descriptions of the other nodes above could also benefit by clarifying the old and new behavior. http://gerrit.cloudera.org:8080/#/c/16842/10/fe/src/main/java/org/apache/impala/planner/AggregationNode.java File fe/src/main/java/org/apache/impala/planner/AggregationNode.java: http://gerrit.cloudera.org:8080/#/c/16842/10/fe/src/main/java/org/apache/impala/planner/AggregationNode.java@66 PS10, Line 66: private final static long MIN_AGG_MEM = 100L * 1024L; It seems if we are fixing this estimate for plain aggregates (i.e with no group-by), even 100KB seems kind of large. If we have SELECT count(*), min(a), sum(b) ... each of these is a scalar value of BIGINT type and consumes 8 bytes each, so if on average we have 10 plain aggregates in a query, it is only 80 bytes. Can we make this estimate a function of the number of plain aggregate functions in an AggregationNode ? Also, the name MIN_AGG_MEM may be generic ..since this is specifically for the no group-by case. One suggestion is MIN_PLAIN_AGG_MEM. Other suggestions ? http://gerrit.cloudera.org:8080/#/c/16842/10/fe/src/main/java/org/apache/impala/planner/AggregationNode.java@538 PS10, Line 538: // A skew factor of 1.5 was added to account for data skew among nit: you need not mention the 1.5 value here because it is already a static final constant. If someone changes the value in the code, this comment will become inconsistent. http://gerrit.cloudera.org:8080/#/c/16842/10/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java: http://gerrit.cloudera.org:8080/#/c/16842/10/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java@324 PS10, Line 324: for (SlotDescriptor slot: desc_.getSlots()) { You could add a new method numMaterializedSlots() to TupleDescriptor because it could be potentially used in other plan nodes. -- To view, visit http://gerrit.cloudera.org:8080/16842 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic01db168ff2c6d6de33ee553a8175599f035d7a1 Gerrit-Change-Number: 16842 Gerrit-PatchSet: 10 Gerrit-Owner: liuyao <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Bikramjeet Vig <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: liuyao <[email protected]> Gerrit-Comment-Date: Tue, 05 Jan 2021 02:24:33 +0000 Gerrit-HasComments: Yes
