Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-2805: Order filters based on selectivity and cost
......................................................................


Patch Set 1:

(20 comments)

Updated to reflect most of Marcel's comments.

http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/AnalyticExpr.java
File fe/src/main/java/com/cloudera/impala/analysis/AnalyticExpr.java:

Line 789:     // AnalyticExprs can never be subexpressions in a conjunct.
> your comments in Expr.java about cost_ don't point out that this is only co
Done


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/BinaryPredicate.java
File fe/src/main/java/com/cloudera/impala/analysis/BinaryPredicate.java:

Line 353:     cost_ = getChildCosts() + BINARY_PREDICATE_COST;
> differentiate based on type: cost of 1 for fixed-len types, something highe
Done


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/CaseExpr.java
File fe/src/main/java/com/cloudera/impala/analysis/CaseExpr.java:

Line 354: 
> get rid of gratuitous blank lines
Done


Line 355:         cost_ += getChild(i).getCost();
> for both cases, you should also add in the max of the then-cost
Done


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/Expr.java
File fe/src/main/java/com/cloudera/impala/analysis/Expr.java:

Line 66:   public final static int ARITHMETIC_OP_COST = 1;
> instead of pointing this out in a reply to a review comment, point this out
Done


Line 67:   public final static int BETWEEN_PREDICATE_COST = 1;
> this is basically two binary predicates, so make it 2*binary_predicate_cost
Because we replace the BetweenPredicate's children with a CompoundPredicate of 
two BinaryPredicates during analysis, the cost of performing the comparisons is 
already accounted for in the cost of the BetweenPredicate's children.

This cost would reflect some additional work that executing the between takes 
beyond executing the child compound predicate - maybe there isn't any and 
executing a between has the same cost as its child compound predicate, in which 
case we might remove this, it just seemed reasonable when I was working on this 
that every Expr should add at least something to the cost of its children.


Line 68:   public final static int BINARY_PREDICATE_COST = 1;
> a lot of these reflect the cost of an O(1) computation. might make more sen
The benefit of having these separate is that it allows us to express constant 
factors, eg. your comment below that timestamp arithmetic is more expensive 
than integer arithmetic even though they're both O(1) operations.

It looks a little pointless right now because we don't know what the constant 
factors are in any sort of precise way, so there's little differentiation, but 
if we were to start doing benchmarking or instruction counting to get these 
numbers they would probably all end up different.


Line 69:   public final static int CASE_COST = 5;
> why?
The general sense that a case is more complicated, and therefore more costly, 
than something like integer arithmetic. Of course, a case is going to be more 
expensive anyways since its cost is calculated as the sum of its WHERE clauses, 
so maybe this is unnecessary.


Line 79:   public final static int SLOT_REF_COST = 5;
> that should be 1, this is just a pointer dereference.
Done


Line 80:   public final static int TIMESTAMP_ARITHMETIC_COST = 1;
> move all of these into the respective classes
The advantage of grouping all of these costs here is that it makes it easier to 
track them all down someday if we start generating them periodically from a 
benchmark or something like that. I don't have a strong opinion either way, 
though.


Line 184:   // its subexpressions.
> subexpressions -> children (is the commonly used term)
The reason I'm not doing this during analysis, and instead just computing costs 
when they're needed, is because analysis doesn't seem to be called on every 
Expr that we're interested in here.

One example is 'pre-analyzed' SlotRefs and CastExprs, though this is easy to 
get around by setting their cost in their constructor.

Harder to fix is something like the BinaryPredicates that are created in 
HashJoinNode::init, which never have analyze called on them as far as I can 
tell.


Line 185:   protected int cost_;
> too generic a name; evalCost_ is more specific
Done


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/com/cloudera/impala/analysis/FunctionCallExpr.java:

Line 506:     cost_ = getChildCosts() + FUNCTION_CALL_COST;
> leave a todo to differentiate based on the specific function. we know that 
Done


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/InPredicate.java
File fe/src/main/java/com/cloudera/impala/analysis/InPredicate.java:

Line 229:     cost_ = getChildCosts() + BINARY_PREDICATE_COST * 
(children_.size() - 1) + IN_COST;
> why -1?
The first child is the value that you compare each of the other children to, so 
there are at most children_.size() - 1 comparisons to be made.


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/LikePredicate.java
File fe/src/main/java/com/cloudera/impala/analysis/LikePredicate.java:

Line 152:     cost_ = getChildCosts() + LIKE_COST;
> leave todo: differentiate based on pattern
Done


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/analysis/StringLiteral.java
File fe/src/main/java/com/cloudera/impala/analysis/StringLiteral.java:

Line 171:     cost_ = value_.length();
> this should also be O(1), because this is really only the cost of materiali
I agree that this is a more principled way of calculating costs, so I'll make 
the change, but I just want to note that in order to be fully accurate we'll 
have to allow udfs to specify their cost as a big-O complexity, and until we do 
udfs that take strings aren't differentiated from those taking numeric types in 
their cost.


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/main/java/com/cloudera/impala/planner/PlanNode.java
File fe/src/main/java/com/cloudera/impala/planner/PlanNode.java:

Line 409:     conjuncts_ = orderConjuncts(conjuncts_);
> do this right after assignConjuncts() (unless there's a specific reason to 
Done


Line 640:     if (conjuncts.size() <= 1) {
> single line
Done


http://gerrit.cloudera.org:8080/#/c/2598/1/fe/src/test/java/com/cloudera/impala/planner/OrderConjunctsTest.java
File fe/src/test/java/com/cloudera/impala/planner/OrderConjunctsTest.java:

Line 29: public class OrderConjunctsTest {
> instead of writing a unit test for this specific functionality, capture the
Done


Line 95: 
> remove gratuitous blank lines
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/2598
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02279a26fbc6308ac5eb819d78345fc010469034
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-HasComments: Yes

Reply via email to