[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-03-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398601#comment-16398601
 ] 

Hudson commented on PHOENIX-1556:
-

FAILURE: Integrated in Jenkins build Phoenix-4.x-HBase-0.98 #1833 (See 
[https://builds.apache.org/job/Phoenix-4.x-HBase-0.98/1833/])
PHOENIX-1556 Base hash versus sort merge join decision on cost (maryannxue: rev 
6914d54d99b4fafae44d1a3397c44ba6e5d10368)
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/SortMergeJoinPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/CorrelatePlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/LiteralResultIterationPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/compile/TraceQueryPlan.java
* (edit) 
phoenix-core/src/it/java/org/apache/phoenix/end2end/CostBasedDecisionIT.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/util/CostUtil.java
* (edit) 
phoenix-core/src/test/java/org/apache/phoenix/query/ParallelIteratorsSplitTest.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/ClientProcessingPlan.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/compile/QueryPlan.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/ByteCountVisitor.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/jdbc/PhoenixStatement.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/UnnestArrayPlan.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/QueryPlanVisitor.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/execute/UnionPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/ClientScanPlan.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/ClientAggregatePlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/TupleProjectionPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/execute/ScanPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/CursorFetchPlan.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/AggregatePlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/compile/ListJarsQueryPlan.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/AvgRowWidthVisitor.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/RowCountVisitor.java


> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-03-13 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398099#comment-16398099
 ] 

Maryann Xue commented on PHOENIX-1556:
--

Done. All 5.x and 4.x branches have been synced with my latest commits.

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-03-09 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16393819#comment-16393819
 ] 

James Taylor commented on PHOENIX-1556:
---

Looks like this wasn't committed to some of the active branches: 4.x-HBase-0.98 
and 4.x-HBase-1.1. Our active branches are 4.x-HBase-0.98, 4.x-HBase-1.1, 
4.x-HBase-1.2, 4.x-cdh5.11.2, 4.x-HBase-1.3, master, and 5.x-HBase-2.0. Would 
you mind committed this to any of those branches you missed please, 
[~maryannxue]?

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Fix For: 4.14.0, 5.0.0
>
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-02-19 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369777#comment-16369777
 ] 

Maryann Xue commented on PHOENIX-1556:
--

Thank you, [~rajeshbabu], for reminding me! 5.x branch has now caught up with 
all my recent check-ins.

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Fix For: 4.14.0
>
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-02-19 Thread Rajeshbabu Chintaguntla (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369018#comment-16369018
 ] 

Rajeshbabu Chintaguntla commented on PHOENIX-1556:
--

[~maryannxue]
I think this is missing in 5.x-HBase-2.0 branch. Can you please commit this to 
the branch as well. Thanks.

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Fix For: 4.14.0
>
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-02-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361630#comment-16361630
 ] 

Hudson commented on PHOENIX-1556:
-

FAILURE: Integrated in Jenkins build Phoenix-master #1934 (See 
[https://builds.apache.org/job/Phoenix-master/1934/])
PHOENIX-1556 Base hash versus sort merge join decision on cost (maryannxue: rev 
9461d0d6a299bb3cbcf53905d7e9b73895a99299)
* (edit) 
phoenix-core/src/it/java/org/apache/phoenix/end2end/CostBasedDecisionIT.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/util/CostUtil.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/compile/QueryPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/SortMergeJoinPlan.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/execute/UnionPlan.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/AvgRowWidthVisitor.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/RowCountVisitor.java
* (edit) 
phoenix-core/src/test/java/org/apache/phoenix/query/ParallelIteratorsSplitTest.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/jdbc/PhoenixStatement.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/ByteCountVisitor.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/CorrelatePlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/LiteralResultIterationPlan.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/execute/ScanPlan.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/execute/visitor/QueryPlanVisitor.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/AggregatePlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/ClientProcessingPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/compile/ListJarsQueryPlan.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/compile/JoinCompiler.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/ClientAggregatePlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/UnnestArrayPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/ClientScanPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/TupleProjectionPlan.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/compile/TraceQueryPlan.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/execute/CursorFetchPlan.java


> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-02-12 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16361456#comment-16361456
 ] 

Maryann Xue commented on PHOENIX-1556:
--

bq. One thing with PageFilter is that it represents the limit pushed down to 
the server. Since the limit cannot always be pushed down (depending on the 
query - for example an aggregate query can push down the limit only if it's 
aggregating on the leading part of the pk), should we consider that? Or do you 
think we can reliably get the limit that's pushed to the server from the query 
plan?

Thank you for this good reminder! Looks like PageFilter is used in two 
occasions (let me know if that's not correct):
1. Set a per scan limit, in which case the total number of bytes scanned will 
remain the same, and the cost should be about the same.
2. Push down a limit to the server side for a ScanPlan. The "limit" itself 
should be taken into account when calculating cost and byte/row numbers. For 
example, an order-by with limit and without limit actually cost differently. 
For a ScanPlan, whether it is able to "stop" early on the server side or it 
depends on the a client-side iterator to stop the scan should not make a 
difference in decision making at this point.
That said, your question just reminded me that there's still a little to be 
done regarding reflecting "limit" in costs:
1. Scan without order-by but with limit is not costed accurately.
2. Limit pushed down to server-side in a hash-join is not reflected in the cost 
yet.

There's also another interesting optimization we can do:
For outer joins, we can push limit down to the "preserved" side. We've already 
done that for hash-joins, while for sort-merge-joins, it is slightly more 
complicated as we'd like to push it further down into the child plan.

I'll open new JIRAs for these improvement tasks. Thanks again for the nice 
input, [~jamestaylor]!

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-02-09 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359066#comment-16359066
 ] 

James Taylor commented on PHOENIX-1556:
---

bq. Yes. stripSkipScanFilter() also aims to eliminate things like PageFilter 
and looks to keep only boolean expression filters that cannot be pushed into PK.
One thing with PageFilter is that it represents the limit pushed down to the 
server. Since the limit cannot always be pushed down (depending on the query - 
for example an aggregate query can push down the limit only if it's aggregating 
on the leading part of the pk), should we consider that? Or do you think we can 
reliably get the limit that's pushed to the server from the query plan?

bq. A probably more realistic approach here might be to set a configurable 
"limit" for specific operators
That's a good idea. I'll file a JIRA and copy/paste your explanation there.

+1 to the patch (assuming tests pass locally for you -- FYI test with the 
4.x-HBase-1.3 branch as there are test failures in master). Great work!

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-02-09 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359048#comment-16359048
 ] 

Maryann Xue commented on PHOENIX-1556:
--

{quote}Should UNION_DISTINCT_FACTOR be 1.0 since we only support UNION ALL 
currently?
{quote}
Since we only support "all", this block won't take effect at all, which means 
the UNION ALL row count will just be the sum of its children's row count.
{quote}What's the reasoning behind stripSkipScanFilter? Is that removed because 
it's effect is already incorporated into the bytes scanned estimate?
{quote}
Yes. {{stripSkipScanFilter()}} also aims to eliminate things like PageFilter 
and looks to keep only boolean expression filters that cannot be pushed into PK.
{quote}Should RowCountVisitor have a method for distinct? In particular, 
there's an optimization we have when doing a distinct on the leading PK columns 
which impacts cost. This optimization is not identified until runtime, so we 
might need to tweak the code so we know about it at compile time. This could be 
done in a separate patch.
{quote}
Thank you for pointing this out! I'll open another JIRA and dig into that.
{quote}Somewhat orthogonal to your pull (but maybe building on top of it), do 
you think it'd be possible to prevent a query from running that's "too 
expensive" (assuming "too expensive" would be identified by a config property)?
{quote}
Maybe. But users should be well aware that the costs are not accurate and they 
do not correspond to a certain amount of time. The absolute value of the cost 
doesn't make so much sense as the difference between the values of alternative 
plans generated from the same query. Besides, consider a QueryPlan consisting 
of a mix of operators, each of which has a different weight in cost evaluation, 
so it would be hard for users to figure out a proper configuration. A probably 
more realistic approach here might be to set a configurable "limit" for 
specific operators. For example, we know that some queries timeout during 
sorting if the dataset is too large, so when calculating the cost for order-by 
(or sometimes client-side order-by), we'd just know. Another example is how we 
handle hash-joins right now: when it's over the limit, we just say it's too 
expensive (represented by the "highest" cost).

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-1556) Base hash versus sort merge join decision on cost

2018-02-09 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358851#comment-16358851
 ] 

James Taylor commented on PHOENIX-1556:
---

Wow, this is really awesome, [~maryannxue]. I love the tests. A couple of 
questions:
- Should UNION_DISTINCT_FACTOR be 1.0 since we only support UNION ALL currently?
{code}
+if (!all) {
+rows *= UNION_DISTINCT_FACTOR;
+}
{code}
- What's the reasoning behind stripSkipScanFilter? Is that removed because it's 
effect is already incorporated into the bytes scanned estimate?
- Should RowCountVisitor have a method for distinct? In particular, there's an 
optimization we have when doing a distinct on the leading PK columns which 
impacts cost. This optimization is not identified until runtime, so we might 
need to tweak the code so we know about it at compile time. This could be done 
in a separate patch.
- Somewhat orthogonal to your pull (but maybe building on top of it), do you 
think it'd be possible to prevent a query from running that's "too expensive" 
(assuming "too expensive" would be identified by a config property)? Something 
to keep in mind - I can file a separate JIRA for this.

> Base hash versus sort merge join decision on cost
> -
>
> Key: PHOENIX-1556
> URL: https://issues.apache.org/jira/browse/PHOENIX-1556
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Maryann Xue
>Priority: Major
>  Labels: CostBasedOptimization
> Attachments: PHOENIX-1556.patch
>
>
> At compile time, we know how many guideposts (i.e. how many bytes) will be 
> scanned for the RHS table. We should, by default, base the decision of using 
> the hash-join verus many-to-many join on this information.
> Another criteria (as we've seen in PHOENIX-4508) is whether or not the tables 
> being joined are already ordered by the join key. In that case, it's better 
> to always use the sort merge join.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)