[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-26 Thread Ankit Singhal (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524192#comment-16524192
 ] 

Ankit Singhal commented on PHOENIX-4666:


[~ortutay] , it seems re-sending of HashJoinCache is regressed by your change, 
if you fixed that then probably your fix will also work.

You can run all HashJoin test which sends cache again for the first time by 
making following change.
{code:java}
--public class HashJoinCacheIT extends BaseJoinIT {
++public class HashJoinCacheIT extends HashJoinGlobalIndexIT {

++ public HashJoinCacheIT(String[] indexDDL, String[] plans) {
++ super(indexDDL, plans);
++ }
++{code}
 

 

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-21 Thread Marcell Ortutay (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519619#comment-16519619
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Ok, let me look at those two approaches, they both are probably workable

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-21 Thread James Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519095#comment-16519095
 ] 

James Taylor commented on PHOENIX-4666:
---

Any chance you can make the generation of the hash join cache callable from 
where that exception is caught in TableResultIterator? In theory, isn't it a 
matter of executing a QueryPlan? Maybe you could push the RHS hash join cache 
plan into BaseResultIterators and then into TableResultIterator?

Not sure on the peeking result iterator, but it might work.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-20 Thread Marcell Ortutay (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518531#comment-16518531
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

> Can you throw this exception (or derive your 
> PersistentHashJoinCacheNotFoundException  from it)?

I don't think so. It's not really a matter of which exception to throw (I was 
throwing HashJoinCacheNotFoundException originally, it's easier to throw a 
separate exception), but rather that by the time we are throwing the exception, 
we are past the point in query execution where we can (easily) re-run the hash 
join cache generation.

One option *might* be to generate a peeking result iterator, and do a "peek()" 
on the first result for uses of persistent cache. Do you think this approach 
would work? [~jamestaylor] & co.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-19 Thread James Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517559#comment-16517559
 ] 

James Taylor commented on PHOENIX-4666:
---

[~an...@apache.org] may be able to advise. He was the one who figured out how 
to react to the HashJoinCacheNotFoundException by sending the HashCache to the 
server on which it couldn't be found (see TableResultIterator). Can you throw 
this exception (or derive your PersistentHashJoinCacheNotFoundException  from 
it)?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-19 Thread Marcell Ortutay (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517546#comment-16517546
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

[~jamestaylor] :

> Easiest would be to just let Phoenix rerun the portion of the query that 
> didn't find the hash cache. What's the reason this approach won't work?
> Otherwise, the top level method of BaseResultIterators managing the parallel 
> scans is submitWork(). Maybe you could do what you want from there? Or higher 
> up in the stack where the hash cache is being created?
 
I haven't been able to find a place to put the exception handler that works. 
Below is some additional info:
 
The hash join cache is generated in HashJoinPlan.iterator() method. This is the 
stack trace from that location:
{code:java}
org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:182)
org.apache.phoenix.execute.DelegateQueryPlan.iterator(DelegateQueryPlan.java:144)
org.apache.phoenix.execute.DelegateQueryPlan.iterator(DelegateQueryPlan.java:139)
org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:316)
org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:295)
org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:294)
org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:286)
org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1838)
sqlline.Commands.execute(Commands.java:822)
sqlline.Commands.sql(Commands.java:732)
sqlline.SqlLine.dispatch(SqlLine.java:813)
sqlline.SqlLine.begin(SqlLine.java:686)
sqlline.SqlLine.start(SqlLine.java:398)
sqlline.SqlLine.main(SqlLine.java:291){code}
 However, the actual execution of the query comes later. This is where 
PersistentHashJoinCacheNotFoundException is thrown:
{code:java}
org.apache.phoenix.coprocessor.PersistentHashJoinCacheNotFoundException: ERROR 
900 (HJ01): Hash Join cache not found joinId: 7. The cache might have expired 
and have been removed.
at org.apache.phoenix.util.ServerUtil.parseRemoteException(ServerUtil.java:189)
at 
org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(ServerUtil.java:174)
at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:141)
at 
org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1327)
at 
org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1245)
at 
org.apache.phoenix.iterate.RoundRobinResultIterator.getIterators(RoundRobinResultIterator.java:176)
at 
org.apache.phoenix.iterate.RoundRobinResultIterator.next(RoundRobinResultIterator.java:91)
at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:805)
at sqlline.BufferedRows.(BufferedRows.java:37)
at sqlline.SqlLine.print(SqlLine.java:1660)
at sqlline.Commands.execute(Commands.java:833)
at sqlline.Commands.sql(Commands.java:732)
at sqlline.SqlLine.dispatch(SqlLine.java:813)
at sqlline.SqlLine.begin(SqlLine.java:686)
at sqlline.SqlLine.start(SqlLine.java:398)
at sqlline.SqlLine.main(SqlLine.java:291)
{code}
Looking at this, the first common ancestor of these two stack traces is 
`sqlline.Commands.execute`, which is outside Phoenix codebase.

I took a quick look at the sqlline code base also. It appears that the flow for 
generating the hash join cache is started here: 
[https://github.com/julianhyde/sqlline/blob/master/src/main/java/sqlline/Commands.java#L823]
 and then the flow which triggers the exception is started here: 
[https://github.com/julianhyde/sqlline/blob/master/src/main/java/sqlline/Commands.java#L834]
 . 

I'm not sure if there's a way to do this doing flow control via exception, 
though I might be missing something. Please let me know. I've experimented with 
putting the exception handler in various places, including the 
ParallelIterators.submitWork() at 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/iterate/ParallelIterators.java#L135]
 , but it does not re-run the hash join cache generation, as that is in a 
previous part of the query execution.

Guidance/advice appreciated ; FWIW the RPC check version does work as expected 
and improves our query runs substantially.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516160#comment-16516160
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on the issue:

https://github.com/apache/phoenix/pull/298
  
@JamesRTaylor almost ready, we've been testing/deploying this internally 
and have a few small changes that I'll push out before it's ready to merge


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512467#comment-16512467
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user JamesRTaylor commented on the issue:

https://github.com/apache/phoenix/pull/298
  
Is this ready to go, @ortutay? It looks pretty clean now.


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-06-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512463#comment-16512463
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r195421295
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java 
---
@@ -1305,14 +1305,25 @@ public ScanWrapper(Scan scan) {
 Scan oldScan = scanPair.getFirst();
 byte[] startKey = 
oldScan.getAttribute(SCAN_ACTUAL_START_ROW);
 if(e2 instanceof 
HashJoinCacheNotFoundException){
+System.out.println("Handling 
HashJoinCacheNotFoundException");
--- End diff --

Minor nit: change to log message or remove System.out.println calls


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-22 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486507#comment-16486507
 ] 

James Taylor commented on PHOENIX-4666:
---

Easiest would be to just let Phoenix rerun the portion of the query that didn't 
find the hash cache. What's the reason this approach won't work?

Otherwise, the top level method of BaseResultIterators managing the parallel 
scans is submitWork(). Maybe you could do what you want from there? Or higher 
up in the stack where the hash cache is being created?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483241#comment-16483241
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

[~maryannxue] [~jamestaylor] I was hoping to get a bit of guidance on where/how 
to handle the exception. I tried adding an exception handler for 
HashJoinCacheNotFoundException in BaseResultIterators.java, as shown here: 
[https://github.com/apache/phoenix/commit/b336644a37f6c65524ee91a06a6859c0215b08f2#diff-8c3d3f644c66ef36d5bc604f017fabfcR1315]
 , but that doesn't seem to be correct. What I was hoping to do was to re-run 
the entire query with caching disabled for specific cache ID's using the 
override mechanism. What actually happens is, apparently, it tries to iterate 
again using the same query? I'm not entirely sure of this part of the code, but 
that is what seems to be happening.

Is there a good way / place to have it re-run the entire query with the the 
change to the StatementContext?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-19 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481490#comment-16481490
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Yea I can add a summary on the PR. Until then, answering the questions here:

(1) I'm not sure actually. it was there before I made any modification.

(2) The naming on this is a bit bad. But basically, my approach was this: on 
first pass, assume caches are present and run the query. Then catch the 
exception if they are not present. If they are not present, disable use of 
persistent cache for not available in StatementContext's 
"Map caches" map. Then, re-run the query 
expecting it to work. Let me know if this is unclear at all; I'll put it in the 
summary.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-18 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481382#comment-16481382
 ] 

Maryann Xue commented on PHOENIX-4666:
--

Thank you for the update, [~ortutay]! I am looking at the PR now. It does look 
a lot cleaner than the last version. Still there's a couple of places I'm 
confused about, so if you could put a summary in your PR about how things work 
and what the workflow is like, it would be great. So my questions are around:

1) what is the parameter "storeCacheOnClient" for?

2) what is the use of "usePersistentCacheOverrides" in StatementContext?

To your HASH_BUILD_LEFT question:

First of all there is no force. In some cases, hash-join just cannot be used 
and it will default to sort-merge join. And we've implemented some cost-based 
query planning logic (currently not turned on by default), so it can be any of 
the three join strategies based on cost estimation. It is almost symmetric 
between HASH_BUILD_LEFT and HASH_BUILD_RIGHT, so you should make sure that your 
approach works for both.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478175#comment-16478175
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

[~maryannxue] I've updated the PR to respond to comments. All but one are 
addressed, please see my question in QueryCompiler.java regarding LHS join 
tables

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478169#comment-16478169
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on the issue:

https://github.com/apache/phoenix/pull/298
  
> Another question to consider besides the suggested changes: do we need a 
SQL command to remove a specific cache entry and/or purge the entire persistent 
cache?

I think that's a good idea but I would prefer to leave it for a future 
enhancement PR. There's a lot of improvements that can be made once people 
start using this and make feature requests. 


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478167#comment-16478167
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r188783687
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java ---
@@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() {
 return memoryManager;
 }
 
-private Cache getServerCaches() {
+private Cache getServerCaches() {
 /* Delay creation of this map until it's needed */
 if (serverCaches == null) {
 synchronized(this) {
 if (serverCaches == null) {
-serverCaches = CacheBuilder.newBuilder()
-.expireAfterAccess(maxTimeToLiveMs, 
TimeUnit.MILLISECONDS)
-.ticker(getTicker())
-.removalListener(new 
RemovalListener(){
-@Override
-public void 
onRemoval(RemovalNotification notification) {
-
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
-}
-})
-.build();
+serverCaches = buildCache(maxTimeToLiveMs, false);
 }
 }
 }
 return serverCaches;
 }
+
+private Cache 
getPersistentServerCaches() {
+/* Delay creation of this map until it's needed */
+if (persistentServerCaches == null) {
+synchronized(this) {
+if (persistentServerCaches == null) {
+persistentServerCaches = 
buildCache(maxPersistenceTimeToLiveMs, true);
+}
+}
+}
+return persistentServerCaches;
+}
+
+private Cache buildCache(final int ttl, 
final boolean isPersistent) {
+return CacheBuilder.newBuilder()
+.expireAfterAccess(ttl, TimeUnit.MILLISECONDS)
+.ticker(getTicker())
+.removalListener(new RemovalListener(){
+@Override
+public void 
onRemoval(RemovalNotification notification) {
+   if (isPersistent || 
!notification.getValue().getUsePersistentCache()) {
+
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
--- End diff --

Fixed it here. FWIW some files mix tabs / spaces


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478166#comment-16478166
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r188783472
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java ---
@@ -383,7 +384,7 @@ protected QueryPlan 
compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo
 new PTable[]{lhsTable}, new int[]{fieldPosition}, 
postJoinFilterExpression, QueryUtil.getOffsetLimit(limit, offset));
 Pair keyRangeExpressions = new 
Pair(null, null);
 getKeyExpressionCombinations(keyRangeExpressions, context, 
joinTable.getStatement(), rhsTableRef, type, joinExpressions, hashExpressions);
-return HashJoinPlan.create(joinTable.getStatement(), 
rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, 
hashExpressions, false, keyRangeExpressions.getFirst(), 
keyRangeExpressions.getSecond())});
+return HashJoinPlan.create(joinTable.getStatement(), 
rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, 
hashExpressions, false, false, keyRangeExpressions.getFirst(), 
keyRangeExpressions.getSecond())});
--- End diff --

@maryannxue I've changed the code here to use the `usePersistentCache` 
flag, but I'm not sure how to force a `HASH_BUILD_LEFT` strategy. The test I've 
put in the integration test file does not actually use that strategy. Do you 
know of an easy way to force a left table hash join?


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478142#comment-16478142
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r188778970
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/memory/GlobalMemoryManager.java 
---
@@ -159,6 +161,10 @@ protected void finalize() throws Throwable {
 }
 
 private void freeMemory() {
+// System.out.println("Free memory! " + size);
--- End diff --

Done


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478140#comment-16478140
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r188778824
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java ---
@@ -314,7 +314,8 @@ protected QueryPlan 
compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo
 if (i < count - 1) {
 fieldPositions[i + 1] = fieldPositions[i] + 
(tables[i] == null ? 0 : (tables[i].getColumns().size() - 
tables[i].getPKColumns().size()));
 }
-hashPlans[i] = new HashSubPlan(i, subPlans[i], 
optimized ? null : hashExpressions, joinSpec.isSingleValueOnly(), 
keyRangeLhsExpression, keyRangeRhsExpression);
+boolean usePersistentCache = 
joinTable.getStatement().getHint().hasHint(Hint.USE_PERSISTENT_CACHE);
--- End diff --

Done


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478124#comment-16478124
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r188777276
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java ---
@@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() {
 return memoryManager;
 }
 
-private Cache getServerCaches() {
+private Cache getServerCaches() {
 /* Delay creation of this map until it's needed */
 if (serverCaches == null) {
 synchronized(this) {
 if (serverCaches == null) {
-serverCaches = CacheBuilder.newBuilder()
-.expireAfterAccess(maxTimeToLiveMs, 
TimeUnit.MILLISECONDS)
-.ticker(getTicker())
-.removalListener(new 
RemovalListener(){
-@Override
-public void 
onRemoval(RemovalNotification notification) {
-
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
-}
-})
-.build();
+serverCaches = buildCache(maxTimeToLiveMs, false);
 }
 }
 }
 return serverCaches;
 }
+
+private Cache 
getPersistentServerCaches() {
+/* Delay creation of this map until it's needed */
+if (persistentServerCaches == null) {
+synchronized(this) {
+if (persistentServerCaches == null) {
+persistentServerCaches = 
buildCache(maxPersistenceTimeToLiveMs, true);
+}
+}
+}
+return persistentServerCaches;
+}
+
+private Cache buildCache(final int ttl, 
final boolean isPersistent) {
+return CacheBuilder.newBuilder()
+.expireAfterAccess(ttl, TimeUnit.MILLISECONDS)
+.ticker(getTicker())
+.removalListener(new RemovalListener(){
+@Override
+public void 
onRemoval(RemovalNotification notification) {
+   if (isPersistent || 
!notification.getValue().getUsePersistentCache()) {
+
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
+   }
+}
+})
+.build();
+}
 
-@Override
+private void evictInactiveEntries(long bytesNeeded) {
+CacheEntry[] entries = 
getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{});
+Arrays.sort(entries);
+long available = this.getMemoryManager().getAvailableMemory();
+for (int i = 0; i < entries.length && available < bytesNeeded; 
i++) {
+CacheEntry entry = entries[i];
+if (!entry.isLive()) {
+   getServerCaches().invalidate(entry.getCacheId());
+   
getPersistentServerCaches().invalidate(entry.getCacheId());
+available = this.getMemoryManager().getAvailableMemory();
+}
+}
+}
+
+private CacheEntry maybeGet(ImmutableBytesPtr cacheId) {
+maybePromote(cacheId);
+CacheEntry entry = getServerCaches().getIfPresent(cacheId);
+return entry;
+}
+
+private void maybePromote(ImmutableBytesPtr cacheId) {
+CacheEntry entry = 
getPersistentServerCaches().getIfPresent(cacheId);
+if (entry == null) {
+return;
+}
+getServerCaches().put(cacheId, entry);
+}
+
+private void maybeDemote(ImmutableBytesPtr cacheId) {
+CacheEntry entry = getServerCaches().getIfPresent(cacheId);
+if (entry == null) {
+return;
+}
+entry.decrementLiveQueryCount();
+if (!entry.isLive()) {
+getServerCaches().invalidate(cacheId);
+}
+}
+
+public void debugPrintCaches() {
+   System.out.println("Live cache:" + getServerCaches());
+   for (ImmutableBytesPtr key : 
getServerCaches().asMap().keySet()) {
+   

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478123#comment-16478123
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r188777161
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java ---
@@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() {
 return memoryManager;
 }
 
-private Cache getServerCaches() {
+private Cache getServerCaches() {
 /* Delay creation of this map until it's needed */
 if (serverCaches == null) {
 synchronized(this) {
 if (serverCaches == null) {
-serverCaches = CacheBuilder.newBuilder()
-.expireAfterAccess(maxTimeToLiveMs, 
TimeUnit.MILLISECONDS)
-.ticker(getTicker())
-.removalListener(new 
RemovalListener(){
-@Override
-public void 
onRemoval(RemovalNotification notification) {
-
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
-}
-})
-.build();
+serverCaches = buildCache(maxTimeToLiveMs, false);
 }
 }
 }
 return serverCaches;
 }
+
+private Cache 
getPersistentServerCaches() {
+/* Delay creation of this map until it's needed */
+if (persistentServerCaches == null) {
+synchronized(this) {
+if (persistentServerCaches == null) {
+persistentServerCaches = 
buildCache(maxPersistenceTimeToLiveMs, true);
+}
+}
+}
+return persistentServerCaches;
+}
+
+private Cache buildCache(final int ttl, 
final boolean isPersistent) {
+return CacheBuilder.newBuilder()
+.expireAfterAccess(ttl, TimeUnit.MILLISECONDS)
+.ticker(getTicker())
+.removalListener(new RemovalListener(){
+@Override
+public void 
onRemoval(RemovalNotification notification) {
+   if (isPersistent || 
!notification.getValue().getUsePersistentCache()) {
+
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
+   }
+}
+})
+.build();
+}
 
-@Override
+private void evictInactiveEntries(long bytesNeeded) {
+CacheEntry[] entries = 
getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{});
+Arrays.sort(entries);
+long available = this.getMemoryManager().getAvailableMemory();
+for (int i = 0; i < entries.length && available < bytesNeeded; 
i++) {
+CacheEntry entry = entries[i];
+if (!entry.isLive()) {
+   getServerCaches().invalidate(entry.getCacheId());
+   
getPersistentServerCaches().invalidate(entry.getCacheId());
+available = this.getMemoryManager().getAvailableMemory();
+}
+}
+}
+
+private CacheEntry maybeGet(ImmutableBytesPtr cacheId) {
--- End diff --

Done, also removed `maybeDemote`


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476586#comment-16476586
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user ortutay commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r188461941
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java ---
@@ -216,22 +234,146 @@ public void close() throws SQLException {
 }
 }
 }
-
+}
+
+public ServerCache checkServerCache(final byte[] cacheId, ScanRanges 
keyRanges, final TableRef cacheUsingTableRef,
--- End diff --

Yea makes sense. For this, I'm thinking of modifying the retry logic in 
BaseResultIterators to disable persistent caching on retries. I'm thinking I 
can add a get/set'er pair to the "context" that can be used for this


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470748#comment-16470748
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user JamesRTaylor commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r187393698
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java ---
@@ -216,22 +234,146 @@ public void close() throws SQLException {
 }
 }
 }
-
+}
+
+public ServerCache checkServerCache(final byte[] cacheId, ScanRanges 
keyRanges, final TableRef cacheUsingTableRef,
--- End diff --

FYI, we already have a mechanism for this. If the server throws a 
HashJoinCacheNotFoundException, then the client will react by sending the hash 
cache to the region servers that don't have it.


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-05-01 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460352#comment-16460352
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for review; working on deploying this internally with changes, will post 
revisions later this week or early next

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446189#comment-16446189
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183135634
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java ---
@@ -488,11 +502,20 @@ public ServerCache execute(HashJoinPlan parent) 
throws SQLException {
 if (hashExpressions != null) {
 ResultIterator iterator = plan.iterator();
 try {
-cache =
-parent.hashClient.addHashCache(ranges, 
iterator,
+final byte[] cacheId;
+if (usePersistentCache) {
--- End diff --

Like I said in an earlier comment, if using persistent cache, we don't need 
to call "check" or "add" cache here, instead we catch a specific Exception and 
addServerCache at this point. Still, we might wanna return "null" if this is 
persistent cache so later on the cache "close" method won't be called by the 
HashJoinPlan clean-up routine.


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446185#comment-16446185
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183128249
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java ---
@@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() {
 return memoryManager;
 }
 
-private Cache getServerCaches() {
+private Cache getServerCaches() {
 /* Delay creation of this map until it's needed */
 if (serverCaches == null) {
 synchronized(this) {
 if (serverCaches == null) {
-serverCaches = CacheBuilder.newBuilder()
-.expireAfterAccess(maxTimeToLiveMs, 
TimeUnit.MILLISECONDS)
-.ticker(getTicker())
-.removalListener(new 
RemovalListener(){
-@Override
-public void 
onRemoval(RemovalNotification notification) {
-
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
-}
-})
-.build();
+serverCaches = buildCache(maxTimeToLiveMs, false);
 }
 }
 }
 return serverCaches;
 }
+
+private Cache 
getPersistentServerCaches() {
+/* Delay creation of this map until it's needed */
+if (persistentServerCaches == null) {
+synchronized(this) {
+if (persistentServerCaches == null) {
+persistentServerCaches = 
buildCache(maxPersistenceTimeToLiveMs, true);
+}
+}
+}
+return persistentServerCaches;
+}
+
+private Cache buildCache(final int ttl, 
final boolean isPersistent) {
+return CacheBuilder.newBuilder()
+.expireAfterAccess(ttl, TimeUnit.MILLISECONDS)
+.ticker(getTicker())
+.removalListener(new RemovalListener(){
+@Override
+public void 
onRemoval(RemovalNotification notification) {
+   if (isPersistent || 
!notification.getValue().getUsePersistentCache()) {
+
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
--- End diff --

Looks like the indentation is not right here. Can you perform a check 
through the entire PR?


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446188#comment-16446188
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183136068
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/memory/GlobalMemoryManager.java 
---
@@ -159,6 +161,10 @@ protected void finalize() throws Throwable {
 }
 
 private void freeMemory() {
+// System.out.println("Free memory! " + size);
--- End diff --

I think this a mistake, missed from your code clean-up.


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446182#comment-16446182
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183126183
  
--- Diff: 
phoenix-core/src/it/java/org/apache/phoenix/end2end/join/HashJoinPersistentCacheIT.java
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.phoenix.end2end.join;
+
+import static org.apache.phoenix.util.TestUtil.TEST_PROPERTIES;
+import static org.junit.Assert.assertEquals;
+
+import java.sql.Connection;
+import java.sql.DriverManager;
+import java.sql.ResultSet;
+import java.util.Properties;
+
+import org.apache.phoenix.end2end.join.HashJoinCacheIT.InvalidateHashCache;
+import org.apache.phoenix.util.PropertiesUtil;
+import org.apache.phoenix.util.SchemaUtil;
+import org.apache.phoenix.util.TestUtil;
+import org.junit.Test;
+
+public class HashJoinPersistentCacheIT extends BaseJoinIT {
+
+@Override
+protected String getTableName(Connection conn, String virtualName) 
throws Exception {
+String realName = super.getTableName(conn, virtualName);
+TestUtil.addCoprocessor(conn, 
SchemaUtil.normalizeFullTableName(realName), InvalidateHashCache.class);
+return realName;
+}
+
+@Test
+public void testPersistentCache() throws Exception {
+  Properties props = PropertiesUtil.deepCopy(TEST_PROPERTIES);
+  Connection conn = DriverManager.getConnection(getUrl(), props);
+
+  createTestTable(getUrl(), "CREATE TABLE IF NOT EXISTS states (state 
CHAR(2) NOT NULL, name VARCHAR NOT NULL CONSTRAINT my_pk PRIMARY KEY (state, 
name))");
+  createTestTable(getUrl(), "CREATE TABLE IF NOT EXISTS cities (state 
CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT CONSTRAINT my_pk 
PRIMARY KEY (state, city))");
+
+  conn.prepareStatement("UPSERT INTO states VALUES ('CA', 
'California')").executeUpdate();
+  conn.prepareStatement("UPSERT INTO states VALUES ('AZ', 
'Arizona')").executeUpdate();
+  conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 'San 
Francisco', 5)").executeUpdate();
+  conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 
'Sacramento', 3000)").executeUpdate();
+  conn.prepareStatement("UPSERT INTO cities VALUES ('AZ', 'Phoenix', 
2)").executeUpdate();
+  conn.commit();
+
+  /* First, run query without using the persistent cache. This should 
return
+   * different results after an UPSERT takes place.
+   */
+  ResultSet rs = conn.prepareStatement("SELECT SUM(population) FROM 
states s JOIN cities c ON c.state = s.state").executeQuery();
+  rs.next();
+  int population1 = rs.getInt(1);
+
+  conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 'Mt View', 
1500)").executeUpdate();
+  conn.commit();
+  rs = conn.prepareStatement("SELECT SUM(population) FROM states s 
JOIN cities c ON c.state = s.state").executeQuery();
+  rs.next();
+  int population2 = rs.getInt(1);
+  
+  assertEquals(73000, population1);
+  assertEquals(74500, population2);
+  
+  /* Second, run query using the persistent cache. This should return 
the
+   * same results after an UPSERT takes place.
+   */
+  rs = conn.prepareStatement("SELECT /*+ USE_PERSISTENT_CACHE */ 
SUM(population) FROM states s JOIN cities c ON c.state = 
s.state").executeQuery();
+  rs.next();
+  int population3 = rs.getInt(1);
+
+  conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 'Palo Alto', 
2000)").executeUpdate();
+  conn.commit();
+  rs = conn.prepareStatement("SELECT /*+ USE_PERSISTENT_CACHE */ 
SUM(population) FROM states s JOIN 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446187#comment-16446187
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183128592
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java ---
@@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() {
 return memoryManager;
 }
 
-private Cache getServerCaches() {
+private Cache getServerCaches() {
 /* Delay creation of this map until it's needed */
 if (serverCaches == null) {
 synchronized(this) {
 if (serverCaches == null) {
-serverCaches = CacheBuilder.newBuilder()
-.expireAfterAccess(maxTimeToLiveMs, 
TimeUnit.MILLISECONDS)
-.ticker(getTicker())
-.removalListener(new 
RemovalListener(){
-@Override
-public void 
onRemoval(RemovalNotification notification) {
-
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
-}
-})
-.build();
+serverCaches = buildCache(maxTimeToLiveMs, false);
 }
 }
 }
 return serverCaches;
 }
+
+private Cache 
getPersistentServerCaches() {
+/* Delay creation of this map until it's needed */
+if (persistentServerCaches == null) {
+synchronized(this) {
+if (persistentServerCaches == null) {
+persistentServerCaches = 
buildCache(maxPersistenceTimeToLiveMs, true);
+}
+}
+}
+return persistentServerCaches;
+}
+
+private Cache buildCache(final int ttl, 
final boolean isPersistent) {
+return CacheBuilder.newBuilder()
+.expireAfterAccess(ttl, TimeUnit.MILLISECONDS)
+.ticker(getTicker())
+.removalListener(new RemovalListener(){
+@Override
+public void 
onRemoval(RemovalNotification notification) {
+   if (isPersistent || 
!notification.getValue().getUsePersistentCache()) {
+
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
+   }
+}
+})
+.build();
+}
 
-@Override
+private void evictInactiveEntries(long bytesNeeded) {
+CacheEntry[] entries = 
getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{});
+Arrays.sort(entries);
+long available = this.getMemoryManager().getAvailableMemory();
+for (int i = 0; i < entries.length && available < bytesNeeded; 
i++) {
+CacheEntry entry = entries[i];
+if (!entry.isLive()) {
+   getServerCaches().invalidate(entry.getCacheId());
+   
getPersistentServerCaches().invalidate(entry.getCacheId());
+available = this.getMemoryManager().getAvailableMemory();
+}
+}
+}
+
+private CacheEntry maybeGet(ImmutableBytesPtr cacheId) {
--- End diff --

Think we can just call it "get" or "getIfPresent" and merge 
{{maybePromote}} into this call.


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446184#comment-16446184
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183133375
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java ---
@@ -314,7 +314,8 @@ protected QueryPlan 
compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo
 if (i < count - 1) {
 fieldPositions[i + 1] = fieldPositions[i] + 
(tables[i] == null ? 0 : (tables[i].getColumns().size() - 
tables[i].getPKColumns().size()));
 }
-hashPlans[i] = new HashSubPlan(i, subPlans[i], 
optimized ? null : hashExpressions, joinSpec.isSingleValueOnly(), 
keyRangeLhsExpression, keyRangeRhsExpression);
+boolean usePersistentCache = 
joinTable.getStatement().getHint().hasHint(Hint.USE_PERSISTENT_CACHE);
--- End diff --

We can make "usePersistentCache" a member of QueryCompiler and initialize 
it in the beginning just like "noChildParentJoinOptimization".


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446186#comment-16446186
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183133702
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java ---
@@ -383,7 +384,7 @@ protected QueryPlan 
compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo
 new PTable[]{lhsTable}, new int[]{fieldPosition}, 
postJoinFilterExpression, QueryUtil.getOffsetLimit(limit, offset));
 Pair keyRangeExpressions = new 
Pair(null, null);
 getKeyExpressionCombinations(keyRangeExpressions, context, 
joinTable.getStatement(), rhsTableRef, type, joinExpressions, hashExpressions);
-return HashJoinPlan.create(joinTable.getStatement(), 
rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, 
hashExpressions, false, keyRangeExpressions.getFirst(), 
keyRangeExpressions.getSecond())});
+return HashJoinPlan.create(joinTable.getStatement(), 
rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, 
hashExpressions, false, false, keyRangeExpressions.getFirst(), 
keyRangeExpressions.getSecond())});
--- End diff --

This is also hash-join but with left table as build side. I think we should 
be able to use persistent cache as well. And could you also add another test to 
cover this point?


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446183#comment-16446183
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183130270
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java ---
@@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() {
 return memoryManager;
 }
 
-private Cache getServerCaches() {
+private Cache getServerCaches() {
 /* Delay creation of this map until it's needed */
 if (serverCaches == null) {
 synchronized(this) {
 if (serverCaches == null) {
-serverCaches = CacheBuilder.newBuilder()
-.expireAfterAccess(maxTimeToLiveMs, 
TimeUnit.MILLISECONDS)
-.ticker(getTicker())
-.removalListener(new 
RemovalListener(){
-@Override
-public void 
onRemoval(RemovalNotification notification) {
-
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
-}
-})
-.build();
+serverCaches = buildCache(maxTimeToLiveMs, false);
 }
 }
 }
 return serverCaches;
 }
+
+private Cache 
getPersistentServerCaches() {
+/* Delay creation of this map until it's needed */
+if (persistentServerCaches == null) {
+synchronized(this) {
+if (persistentServerCaches == null) {
+persistentServerCaches = 
buildCache(maxPersistenceTimeToLiveMs, true);
+}
+}
+}
+return persistentServerCaches;
+}
+
+private Cache buildCache(final int ttl, 
final boolean isPersistent) {
+return CacheBuilder.newBuilder()
+.expireAfterAccess(ttl, TimeUnit.MILLISECONDS)
+.ticker(getTicker())
+.removalListener(new RemovalListener(){
+@Override
+public void 
onRemoval(RemovalNotification notification) {
+   if (isPersistent || 
!notification.getValue().getUsePersistentCache()) {
+
Closeables.closeAllQuietly(Collections.singletonList(notification.getValue()));
+   }
+}
+})
+.build();
+}
 
-@Override
+private void evictInactiveEntries(long bytesNeeded) {
+CacheEntry[] entries = 
getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{});
+Arrays.sort(entries);
+long available = this.getMemoryManager().getAvailableMemory();
+for (int i = 0; i < entries.length && available < bytesNeeded; 
i++) {
+CacheEntry entry = entries[i];
+if (!entry.isLive()) {
+   getServerCaches().invalidate(entry.getCacheId());
+   
getPersistentServerCaches().invalidate(entry.getCacheId());
+available = this.getMemoryManager().getAvailableMemory();
+}
+}
+}
+
+private CacheEntry maybeGet(ImmutableBytesPtr cacheId) {
+maybePromote(cacheId);
+CacheEntry entry = getServerCaches().getIfPresent(cacheId);
+return entry;
+}
+
+private void maybePromote(ImmutableBytesPtr cacheId) {
+CacheEntry entry = 
getPersistentServerCaches().getIfPresent(cacheId);
+if (entry == null) {
+return;
+}
+getServerCaches().put(cacheId, entry);
+}
+
+private void maybeDemote(ImmutableBytesPtr cacheId) {
+CacheEntry entry = getServerCaches().getIfPresent(cacheId);
+if (entry == null) {
+return;
+}
+entry.decrementLiveQueryCount();
+if (!entry.isLive()) {
+getServerCaches().invalidate(cacheId);
+}
+}
+
+public void debugPrintCaches() {
+   System.out.println("Live cache:" + getServerCaches());
+   for (ImmutableBytesPtr key : 
getServerCaches().asMap().keySet()) {

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446181#comment-16446181
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

Github user maryannxue commented on a diff in the pull request:

https://github.com/apache/phoenix/pull/298#discussion_r183127442
  
--- Diff: 
phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java ---
@@ -216,22 +234,146 @@ public void close() throws SQLException {
 }
 }
 }
-
+}
+
+public ServerCache checkServerCache(final byte[] cacheId, ScanRanges 
keyRanges, final TableRef cacheUsingTableRef,
--- End diff --

I am thinking here, can we do this differently?
Instead of making RPC calls of "checkServerCache" for the first and every 
subsequent queries, we do NOT make any calls, neither "check" or "add" when the 
persistent-cache hint is available and catches the 
{{PersistentCacheNotFoundException}} on the first attempt (or later attempts if 
somehow the cache has been evicted) and then try adding the cache all over 
again. I think it will be more efficient in general.


> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-20 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446159#comment-16446159
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

thanks [~maryannxue]

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-19 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445048#comment-16445048
 ] 

Maryann Xue commented on PHOENIX-4666:
--

Thank you for the PR, [~ortutay]! Looking at it now.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441554#comment-16441554
 ] 

ASF GitHub Bot commented on PHOENIX-4666:
-

GitHub user ortutay opened a pull request:

https://github.com/apache/phoenix/pull/298

PHOENIX-4666 Persistent subquery cache for hash joins



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ortutay/phoenix PHOENIX-4666-subquery-cache

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/phoenix/pull/298.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #298


commit 77460c37697b2a112cf6ed345356a16da08dd51c
Author: Marcell Ortutay 
Date:   2018-03-29T19:59:03Z

PHOENIX-4666 Persistent subquery cache for hash joins

commit d1fc310e3d0df772c0aeb1673d2b64d01f495d27
Author: Marcell Ortutay 
Date:   2018-04-17T20:51:00Z

PHOENIX-4666 Add tests for TenantCacheTest




> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438067#comment-16438067
 ] 

Maryann Xue commented on PHOENIX-4666:
--

Sounds good! Please remember to add more tests.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438013#comment-16438013
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

FYI added hint-to-enable ("USE_PERSISTENT_CACHE") to my fork. This is the last 
"feature" that I think is needed for a v1. I'm going to clean things up a bit 
and refactor for the noChildParentJoinOptimization approach suggested above and 
then submit for a full review.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437901#comment-16437901
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Got it–for option (2) that would basically require calling setScanRange() with 
argument to scan "all"?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437881#comment-16437881
 ] 

Maryann Xue commented on PHOENIX-4666:
--

Thank you for explaining 2! And now it's clear to me. It is actually an 
optimization implemented in PHOENIX-852. It doesn't not apply to all join 
queries, so I guess in the test case you have there this optimization is 
triggered. So I'm thinking two options here:
 1) Call {{HashCacheClient#evaluateKeyExpression()}} to get the key ranges if 
cache is already available on the server side, in which case 
{{CachedSubqueryResultIterator}} would still be needed but we do not add cache 
one more time. We can have a client-side cache for such key-range values as 
well. And if this is the first client building the cache for the first time, we 
get these values from calling {{addHashCache()}} and then cache them on the 
client side.
 2) A more radical but easier approach is to disable this "child-parent (FK-PK) 
join optimization" when using persistent cache. This makes some practical 
sense: if we can make a big performance gain from avoiding rebuilding the hash 
cache, it could indicate that the cache itself might be of some considerable 
side, and thus the key ranges generated from a relatively large amount of 
values might not be that useful to narrow down the scan anyway.
 For now, I actually prefer the second approach, since we can focus on the main 
part of this issue and move forward faster.

For 5: Just for simplicity. Feel like we can have less getters and "get" calls 
here.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437759#comment-16437759
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for taking a look [~maryannxue]. responses below:

> 1. First of all, I think it's important that we have an option to enable and 
>disable the persistent cache, making sure that users can still run join 
>queries in the default temp-cache way.

Yes definitely. In fact I am adding a hint, and for now I think it makes sense 
to only enable it if that hint is there, so we don't break any existing 
behavior.

> 2. Regarding to your change [2], can you explain what exactly is the problem 
>of key-range generation? Looks like checkCache() and addCache() are doing 
>redundant work, and CachedSubqueryResultIterator should be unnecessary. We do 
>not wish to read the cache on the client side and then re-add the cache again.

Yes in my first attempt at this I did not have the redundant work, but I ran 
into a bug where I was getting empty results when using the cached code path. 
If you look at ServerCache.addHashCache() you'll notice that it calls 
serialize(): 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L85]

serialize() produces that serialized RHS join cache and iterates over all the 
results in the ResultIterator for the RHS query. In this line 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131]
 it adds entries to keyRangeRhsValues. This is a list of the key values on the 
RHS of the join. It is used here 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java#L226]
 in HashJoinPlan as part of the query, and at some point it becomes an argument 
to set the scan range (I can dig up where if you'd like).

For this reason the cached code path somehow needs to generate correct values 
for keyRangeRhsValues, or correct values for the scan range, and these values 
need to be available on the client side.

The approach I did just re-runs the same codepath for both no-cache and cached 
queries. The advantage is it was fairly simple to implement and it guarantees 
identical execution. The downside is the redundant work. It would also be 
possible to have special case code to set the scan range for cached queries. 
This is a bit harder to implement but is more efficient.

Happy to hear what people think about this. Maybe there is something much 
simpler that I am missing!

> 3. We need to be aware that the string representation of the sub-query 
>statement is not reliable, which means the same join-tables or sub-queries do 
>not necessarily map to the same string representation, and thus will have 
>different generated cache-id. It'd be optimal if we can have some 
>normalization here. We can consider leaving this as a future improvement, yet 
>at this point we'd better have some test cases (counter cases as well) to 
>cover this point.

Yes, definitely. I'd prefer to leave this as a future improvement to keep the 
initial PR focused. IIRC I saw for some complex queries there is a "$1" or "$2" 
placeholder, which changes even across identical queries. There are probably 
more things like this, eg. "x=10 AND y=20" is the same as "y=20 AND x=10".

> 4. Is there a way for us to update the cache content if tables have been 
>updated? This might be related to what approach we take to add and re-validate 
>cache in (2).

Currently, no. I was thinking though that the user application can control 
invalidation using a hint, like this:

    /*+ CACHE_PERSISTENT('2018-01-01 12:00') */

The '2018-01-01 12:00' would be a suffix to whatever cacheId we generate, like 
this:

    cacheId = hash(cacheId + '2018-01-01 12:00')

which lets the application explicitly invalidate the cache when needed.

> 5. A rather minor point as it just occurred to me: Can we have CacheEntry 
>implement Closable?

Yes. Just so I know, what is the benefit of this?

And yes, apologies for the messy code. I'm fixing it up today and it should be 
ready for a more thorough review today or tomorrow.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used 

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436978#comment-16436978
 ] 

Maryann Xue commented on PHOENIX-4666:
--

Thank you very much for your work, [~ortutay]! Here's a few things on the 
high-level:
1. First of all, I think it's important that we have an option to enable and 
disable the persistent cache, making sure that users can still run join queries 
in the default temp-cache way.
2. Regarding to your change [2], can you explain what exactly is the problem of 
key-range generation? Looks like checkCache() and addCache() are doing 
redundant work, and CachedSubqueryResultIterator should be unnecessary. We do 
not wish to read the cache on the client side and then re-add the cache again.
3. We need to be aware that the string representation of the sub-query 
statement is not reliable, which means the same join-tables or sub-queries do 
not necessarily map to the same string representation, and thus will have 
different generated cache-id. It'd be optimal if we can have some normalization 
here. We can consider leaving this as a future improvement, yet at this point 
we'd better have some test cases (counter cases as well) to cover this point.
4. Is there a way for us to update the cache content if tables have been 
updated? This might be related to what approach we take to add and re-validate 
cache in (2).
5. A rather minor point as it just occurred to me: Can we have CacheEntry 
implement Closable?
Lastly, I understand that it's work in progress, but as we move on, could you 
please do a little clean-up so it would be easier for discussions and code 
reviews? For example, correct the indentation (make sure there's no tabs); 
instead of commenting out a line of code, can you just remove it; and get rid 
of all "system.out.println" or replace them with logging if necessary?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-11 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434855#comment-16434855
 ] 

Maryann Xue commented on PHOENIX-4666:
--

Yes, sure.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-11 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434844#comment-16434844
 ] 

James Taylor commented on PHOENIX-4666:
---

Would you have any spare cycles to take a look, [~maryannxue]?

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-09 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431546#comment-16431546
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

A update on this: I've implemented a basic version of this that re-uses the RHS 
results in a subquery cache. I made a few changes to the original hacky 
implementation that I wanted to get some feedback on.

My code is here: 
[https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please 
note this is a work in progress.

I've changed the following things:
 # In my first implementation, I stored a mapping of subquery hash -> 
ServerCache client side. This works in the single client use case but doesn't 
work if you have a cluster of PQS servers (which is our situation at 23andMe). 
So instead I replaced this with an RPC mechanism. The client will send an RPC 
to each region server, and check if the subquery results are available.
 # Originally I planned to only return a boolean in the RPC check. However, I 
ran into an issue. It turns out that the serialize() method is involved in the 
generation of key ranges that are used in the query [1]. This serialize() 
method is in the addHashCache() code path. In order to make sure this code is 
hit, I am creating a CachedSubqueryResultIterator which is passed to the 
addHashCache() code path. This ensures that all side effects, like the key 
range generation, are the same between cached / uncached code paths.

Would love to get feedback on this approach. For (2) there is an alternate 
approach that also caches the key ranges. This is more efficient but has the 
downside of needing specialized code.

Work still left to do is eviction logic, and hint to enable, and general 
cleanup/testing.

[1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-23 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23') */ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408984#comment-16408984
 ] 

James Taylor commented on PHOENIX-4666:
---

I like the simplicity of your design, [~ortutay], in using the hash as the 
cache ID. You could just assume that the cache is already available and react 
to the exception you get back by generating the cache if it's not. That way 
you'd need no mapping at all (and no central place to check if the cache ID 
maps to an existing cache). That flow is already there, but you'd need to add 
the logic to generate the cache as the current code assumes that it has already 
built the cache (i.e. this handles the situation in which a region splits and 
the new RS doesn't have the cache yet).

Some considerations:
- at a minimum, we could have a global config for the TTL of the cache when 
this feature is enabled (so that it'd be a different config than the standard 
TTL config).
- at the finest granularity, you could even create a new hint that specifies 
the TTL so you could specify it per query.
- we'd want to make it clear in documentation that the cache data would be 
stale once generated (until the TTL expires it).
- might consider having a new table level property on which this feature could 
be enabled (or a table-specific TTL could be specified)

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408720#comment-16408720
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

Thanks for the input [~maryannxue]. My current implementation is here: 
[https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in 
particular this is my hacky patch: 
[https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af]
 . Instead of a random cache ID it takes a hash of the query statement and uses 
that as the cache ID. Each Phoenix client maintains it's own memory of which 
cache IDs have already been executed (this is not ideal, but it was easy to 
implement this way).

If I'm understanding your proposal, the Phoenix client would attempt to use a 
cache ID with the expectation that it exists on region servers. The region 
server would throw an exception if the cache ID is not found, which indicates 
to Phoenix client that it should evaluate the subquery as usual.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408676#comment-16408676
 ] 

Maryann Xue commented on PHOENIX-4666:
--

Here's my initial thought:
 # As opposed to maintaining a map from a subquery to a persistent cache, we 
could have a map from "cache_id" to a persistent cache. What we are doing now 
for "cache_id" should be good enough - client-side-generated unique id.
 # We'll have a certain property to indicate the use of persistent cache, so 
that a HashJoinPlan will re-use the "cache_id" (after it being generated on the 
first call) without having to re-send and re-build the caches on region servers.
 # User will have to hold on to the compiled statement and re-execute it.
 # There's no need for a centralized cache management at this point. A region 
server will throw a specific Exception if it is unable to find the cache 
requested, then on the client side, the sub-query will be re-evaluated and the 
cache will broadcasted to all region servers again with the same cache-id.

Not sure if is similar to your version, but would be nice to see whatever you 
already have and we can go from there.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408666#comment-16408666
 ] 

Marcell Ortutay commented on PHOENIX-4666:
--

As I mentioned above we’re working on a design proposal for this internally at 
23andMe, and there’s one big decision that I wanted to get feedback on.

There is currently “server cache” that is used by the hash join process in 
Phoenix. Hash join tables are broadcast to all region servers that need it, and 
the hash joining happens via coprocessor. This cache is deleted after the query 
ends.

My first thought for a persistent cache was to re-use the server cache, and 
extend the TTL and change the key (“cacheId”) generation. I implemented this as 
a hacky proof-of-concept and it worked quite well, the performance was much 
improved.

However, I’m wondering if a separate cache makes more sense. The current server 
cache has a different use case than a persistent cache, and as such it may be a 
good idea to separate the two.

Some ways in which they are different:

- A persistent cache performs eviction when there is no space left. The server 
cache raises an exception, and the user must do a merge sort join instead.

- Users may want to configure the two differently, eg. allocate more space for 
a persistent cache than the server cache, and set a higher TTL

- The server cache data must be available on all region servers doing the hash 
join. In contrast, the persistent cache only needs 1 copy of the data across 
the system (ie. across all region servers) until the data is needed. Doing this 
would be more space efficient, but result in more network transfer.

- You could in theory have a pluggable system for the persistent cache, eg. use 
memcache or something

 

That said, there are advantages to keeping it all in the server cache:

 

- Simpler implementation, does not add a new system to Phoenix

- Faster in the case that you get a cache hit, since there is no network 
transfer involved

 

Would love to get some feedback / opinions on this, thanks!

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)