Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-2805: Order filters based on selectivity and cost ......................................................................
Patch Set 1: (20 comments) Updated to reflect most of Marcel's comments. http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/AnalyticExpr.java File fe/src/main/java/com/cloudera/impala/analysis/AnalyticExpr.java: Line 789: // AnalyticExprs can never be subexpressions in a conjunct. > your comments in Expr.java about cost_ don't point out that this is only co Done http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/BinaryPredicate.java File fe/src/main/java/com/cloudera/impala/analysis/BinaryPredicate.java: Line 353: cost_ = getChildCosts() + BINARY_PREDICATE_COST; > differentiate based on type: cost of 1 for fixed-len types, something highe Done http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/CaseExpr.java File fe/src/main/java/com/cloudera/impala/analysis/CaseExpr.java: Line 354: > get rid of gratuitous blank lines Done Line 355: cost_ += getChild(i).getCost(); > for both cases, you should also add in the max of the then-cost Done http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/Expr.java File fe/src/main/java/com/cloudera/impala/analysis/Expr.java: Line 66: public final static int ARITHMETIC_OP_COST = 1; > instead of pointing this out in a reply to a review comment, point this out Done Line 67: public final static int BETWEEN_PREDICATE_COST = 1; > this is basically two binary predicates, so make it 2*binary_predicate_cost Because we replace the BetweenPredicate's children with a CompoundPredicate of two BinaryPredicates during analysis, the cost of performing the comparisons is already accounted for in the cost of the BetweenPredicate's children. This cost would reflect some additional work that executing the between takes beyond executing the child compound predicate - maybe there isn't any and executing a between has the same cost as its child compound predicate, in which case we might remove this, it just seemed reasonable when I was working on this that every Expr should add at least something to the cost of its children. Line 68: public final static int BINARY_PREDICATE_COST = 1; > a lot of these reflect the cost of an O(1) computation. might make more sen The benefit of having these separate is that it allows us to express constant factors, eg. your comment below that timestamp arithmetic is more expensive than integer arithmetic even though they're both O(1) operations. It looks a little pointless right now because we don't know what the constant factors are in any sort of precise way, so there's little differentiation, but if we were to start doing benchmarking or instruction counting to get these numbers they would probably all end up different. Line 69: public final static int CASE_COST = 5; > why? The general sense that a case is more complicated, and therefore more costly, than something like integer arithmetic. Of course, a case is going to be more expensive anyways since its cost is calculated as the sum of its WHERE clauses, so maybe this is unnecessary. Line 79: public final static int SLOT_REF_COST = 5; > that should be 1, this is just a pointer dereference. Done Line 80: public final static int TIMESTAMP_ARITHMETIC_COST = 1; > move all of these into the respective classes The advantage of grouping all of these costs here is that it makes it easier to track them all down someday if we start generating them periodically from a benchmark or something like that. I don't have a strong opinion either way, though. Line 184: // its subexpressions. > subexpressions -> children (is the commonly used term) The reason I'm not doing this during analysis, and instead just computing costs when they're needed, is because analysis doesn't seem to be called on every Expr that we're interested in here. One example is 'pre-analyzed' SlotRefs and CastExprs, though this is easy to get around by setting their cost in their constructor. Harder to fix is something like the BinaryPredicates that are created in HashJoinNode::init, which never have analyze called on them as far as I can tell. Line 185: protected int cost_; > too generic a name; evalCost_ is more specific Done http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/FunctionCallExpr.java File fe/src/main/java/com/cloudera/impala/analysis/FunctionCallExpr.java: Line 506: cost_ = getChildCosts() + FUNCTION_CALL_COST; > leave a todo to differentiate based on the specific function. we know that Done http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/InPredicate.java File fe/src/main/java/com/cloudera/impala/analysis/InPredicate.java: Line 229: cost_ = getChildCosts() + BINARY_PREDICATE_COST * (children_.size() - 1) + IN_COST; > why -1? The first child is the value that you compare each of the other children to, so there are at most children_.size() - 1 comparisons to be made. http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/LikePredicate.java File fe/src/main/java/com/cloudera/impala/analysis/LikePredicate.java: Line 152: cost_ = getChildCosts() + LIKE_COST; > leave todo: differentiate based on pattern Done http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/StringLiteral.java File fe/src/main/java/com/cloudera/impala/analysis/StringLiteral.java: Line 171: cost_ = value_.length(); > this should also be O(1), because this is really only the cost of materiali I agree that this is a more principled way of calculating costs, so I'll make the change, but I just want to note that in order to be fully accurate we'll have to allow udfs to specify their cost as a big-O complexity, and until we do udfs that take strings aren't differentiated from those taking numeric types in their cost. http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/planner/PlanNode.java File fe/src/main/java/com/cloudera/impala/planner/PlanNode.java: Line 409: conjuncts_ = orderConjuncts(conjuncts_); > do this right after assignConjuncts() (unless there's a specific reason to Done Line 640: if (conjuncts.size() <= 1) { > single line Done http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/test/java/com/cloudera/impala/planner/OrderConjunctsTest.java File fe/src/test/java/com/cloudera/impala/planner/OrderConjunctsTest.java: Line 29: public class OrderConjunctsTest { > instead of writing a unit test for this specific functionality, capture the Done Line 95: > remove gratuitous blank lines Done -- To view, visit http://gerrit.cloudera.org:8080/2598 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-HasComments: Yes
