[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16524192#comment-16524192 ] Ankit Singhal commented on PHOENIX-4666: [~ortutay] , it seems re-sending of HashJoinCache is regressed by your change, if you fixed that then probably your fix will also work. You can run all HashJoin test which sends cache again for the first time by making following change. {code:java} --public class HashJoinCacheIT extends BaseJoinIT { ++public class HashJoinCacheIT extends HashJoinGlobalIndexIT { ++ public HashJoinCacheIT(String[] indexDDL, String[] plans) { ++ super(indexDDL, plans); ++ } ++{code} > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519619#comment-16519619 ] Marcell Ortutay commented on PHOENIX-4666: -- Ok, let me look at those two approaches, they both are probably workable > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519095#comment-16519095 ] James Taylor commented on PHOENIX-4666: --- Any chance you can make the generation of the hash join cache callable from where that exception is caught in TableResultIterator? In theory, isn't it a matter of executing a QueryPlan? Maybe you could push the RHS hash join cache plan into BaseResultIterators and then into TableResultIterator? Not sure on the peeking result iterator, but it might work. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518531#comment-16518531 ] Marcell Ortutay commented on PHOENIX-4666: -- > Can you throw this exception (or derive your > PersistentHashJoinCacheNotFoundException from it)? I don't think so. It's not really a matter of which exception to throw (I was throwing HashJoinCacheNotFoundException originally, it's easier to throw a separate exception), but rather that by the time we are throwing the exception, we are past the point in query execution where we can (easily) re-run the hash join cache generation. One option *might* be to generate a peeking result iterator, and do a "peek()" on the first result for uses of persistent cache. Do you think this approach would work? [~jamestaylor] & co. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517559#comment-16517559 ] James Taylor commented on PHOENIX-4666: --- [~an...@apache.org] may be able to advise. He was the one who figured out how to react to the HashJoinCacheNotFoundException by sending the HashCache to the server on which it couldn't be found (see TableResultIterator). Can you throw this exception (or derive your PersistentHashJoinCacheNotFoundException from it)? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517546#comment-16517546 ] Marcell Ortutay commented on PHOENIX-4666: -- [~jamestaylor] : > Easiest would be to just let Phoenix rerun the portion of the query that > didn't find the hash cache. What's the reason this approach won't work? > Otherwise, the top level method of BaseResultIterators managing the parallel > scans is submitWork(). Maybe you could do what you want from there? Or higher > up in the stack where the hash cache is being created? I haven't been able to find a place to put the exception handler that works. Below is some additional info: The hash join cache is generated in HashJoinPlan.iterator() method. This is the stack trace from that location: {code:java} org.apache.phoenix.execute.HashJoinPlan.iterator(HashJoinPlan.java:182) org.apache.phoenix.execute.DelegateQueryPlan.iterator(DelegateQueryPlan.java:144) org.apache.phoenix.execute.DelegateQueryPlan.iterator(DelegateQueryPlan.java:139) org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:316) org.apache.phoenix.jdbc.PhoenixStatement$1.call(PhoenixStatement.java:295) org.apache.phoenix.call.CallRunner.run(CallRunner.java:53) org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:294) org.apache.phoenix.jdbc.PhoenixStatement.executeQuery(PhoenixStatement.java:286) org.apache.phoenix.jdbc.PhoenixStatement.execute(PhoenixStatement.java:1838) sqlline.Commands.execute(Commands.java:822) sqlline.Commands.sql(Commands.java:732) sqlline.SqlLine.dispatch(SqlLine.java:813) sqlline.SqlLine.begin(SqlLine.java:686) sqlline.SqlLine.start(SqlLine.java:398) sqlline.SqlLine.main(SqlLine.java:291){code} However, the actual execution of the query comes later. This is where PersistentHashJoinCacheNotFoundException is thrown: {code:java} org.apache.phoenix.coprocessor.PersistentHashJoinCacheNotFoundException: ERROR 900 (HJ01): Hash Join cache not found joinId: 7. The cache might have expired and have been removed. at org.apache.phoenix.util.ServerUtil.parseRemoteException(ServerUtil.java:189) at org.apache.phoenix.util.ServerUtil.parseServerExceptionOrNull(ServerUtil.java:174) at org.apache.phoenix.util.ServerUtil.parseServerException(ServerUtil.java:141) at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1327) at org.apache.phoenix.iterate.BaseResultIterators.getIterators(BaseResultIterators.java:1245) at org.apache.phoenix.iterate.RoundRobinResultIterator.getIterators(RoundRobinResultIterator.java:176) at org.apache.phoenix.iterate.RoundRobinResultIterator.next(RoundRobinResultIterator.java:91) at org.apache.phoenix.jdbc.PhoenixResultSet.next(PhoenixResultSet.java:805) at sqlline.BufferedRows.(BufferedRows.java:37) at sqlline.SqlLine.print(SqlLine.java:1660) at sqlline.Commands.execute(Commands.java:833) at sqlline.Commands.sql(Commands.java:732) at sqlline.SqlLine.dispatch(SqlLine.java:813) at sqlline.SqlLine.begin(SqlLine.java:686) at sqlline.SqlLine.start(SqlLine.java:398) at sqlline.SqlLine.main(SqlLine.java:291) {code} Looking at this, the first common ancestor of these two stack traces is `sqlline.Commands.execute`, which is outside Phoenix codebase. I took a quick look at the sqlline code base also. It appears that the flow for generating the hash join cache is started here: [https://github.com/julianhyde/sqlline/blob/master/src/main/java/sqlline/Commands.java#L823] and then the flow which triggers the exception is started here: [https://github.com/julianhyde/sqlline/blob/master/src/main/java/sqlline/Commands.java#L834] . I'm not sure if there's a way to do this doing flow control via exception, though I might be missing something. Please let me know. I've experimented with putting the exception handler in various places, including the ParallelIterators.submitWork() at [https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/iterate/ParallelIterators.java#L135] , but it does not re-run the hash join cache generation, as that is in a previous part of the query execution. Guidance/advice appreciated ; FWIW the RPC check version does work as expected and improves our query runs substantially. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516160#comment-16516160 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on the issue: https://github.com/apache/phoenix/pull/298 @JamesRTaylor almost ready, we've been testing/deploying this internally and have a few small changes that I'll push out before it's ready to merge > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512467#comment-16512467 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user JamesRTaylor commented on the issue: https://github.com/apache/phoenix/pull/298 Is this ready to go, @ortutay? It looks pretty clean now. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512463#comment-16512463 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r195421295 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/iterate/BaseResultIterators.java --- @@ -1305,14 +1305,25 @@ public ScanWrapper(Scan scan) { Scan oldScan = scanPair.getFirst(); byte[] startKey = oldScan.getAttribute(SCAN_ACTUAL_START_ROW); if(e2 instanceof HashJoinCacheNotFoundException){ +System.out.println("Handling HashJoinCacheNotFoundException"); --- End diff -- Minor nit: change to log message or remove System.out.println calls > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486507#comment-16486507 ] James Taylor commented on PHOENIX-4666: --- Easiest would be to just let Phoenix rerun the portion of the query that didn't find the hash cache. What's the reason this approach won't work? Otherwise, the top level method of BaseResultIterators managing the parallel scans is submitWork(). Maybe you could do what you want from there? Or higher up in the stack where the hash cache is being created? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483241#comment-16483241 ] Marcell Ortutay commented on PHOENIX-4666: -- [~maryannxue] [~jamestaylor] I was hoping to get a bit of guidance on where/how to handle the exception. I tried adding an exception handler for HashJoinCacheNotFoundException in BaseResultIterators.java, as shown here: [https://github.com/apache/phoenix/commit/b336644a37f6c65524ee91a06a6859c0215b08f2#diff-8c3d3f644c66ef36d5bc604f017fabfcR1315] , but that doesn't seem to be correct. What I was hoping to do was to re-run the entire query with caching disabled for specific cache ID's using the override mechanism. What actually happens is, apparently, it tries to iterate again using the same query? I'm not entirely sure of this part of the code, but that is what seems to be happening. Is there a good way / place to have it re-run the entire query with the the change to the StatementContext? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481490#comment-16481490 ] Marcell Ortutay commented on PHOENIX-4666: -- Yea I can add a summary on the PR. Until then, answering the questions here: (1) I'm not sure actually. it was there before I made any modification. (2) The naming on this is a bit bad. But basically, my approach was this: on first pass, assume caches are present and run the query. Then catch the exception if they are not present. If they are not present, disable use of persistent cache for not available in StatementContext's "Mapcaches" map. Then, re-run the query expecting it to work. Let me know if this is unclear at all; I'll put it in the summary. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481382#comment-16481382 ] Maryann Xue commented on PHOENIX-4666: -- Thank you for the update, [~ortutay]! I am looking at the PR now. It does look a lot cleaner than the last version. Still there's a couple of places I'm confused about, so if you could put a summary in your PR about how things work and what the workflow is like, it would be great. So my questions are around: 1) what is the parameter "storeCacheOnClient" for? 2) what is the use of "usePersistentCacheOverrides" in StatementContext? To your HASH_BUILD_LEFT question: First of all there is no force. In some cases, hash-join just cannot be used and it will default to sort-merge join. And we've implemented some cost-based query planning logic (currently not turned on by default), so it can be any of the three join strategies based on cost estimation. It is almost symmetric between HASH_BUILD_LEFT and HASH_BUILD_RIGHT, so you should make sure that your approach works for both. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478175#comment-16478175 ] Marcell Ortutay commented on PHOENIX-4666: -- [~maryannxue] I've updated the PR to respond to comments. All but one are addressed, please see my question in QueryCompiler.java regarding LHS join tables > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478169#comment-16478169 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on the issue: https://github.com/apache/phoenix/pull/298 > Another question to consider besides the suggested changes: do we need a SQL command to remove a specific cache entry and/or purge the entire persistent cache? I think that's a good idea but I would prefer to leave it for a future enhancement PR. There's a lot of improvements that can be made once people start using this and make feature requests. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478167#comment-16478167 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r188783687 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java --- @@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() { return memoryManager; } -private CachegetServerCaches() { +private Cache getServerCaches() { /* Delay creation of this map until it's needed */ if (serverCaches == null) { synchronized(this) { if (serverCaches == null) { -serverCaches = CacheBuilder.newBuilder() -.expireAfterAccess(maxTimeToLiveMs, TimeUnit.MILLISECONDS) -.ticker(getTicker()) -.removalListener(new RemovalListener (){ -@Override -public void onRemoval(RemovalNotification notification) { - Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); -} -}) -.build(); +serverCaches = buildCache(maxTimeToLiveMs, false); } } } return serverCaches; } + +private Cache getPersistentServerCaches() { +/* Delay creation of this map until it's needed */ +if (persistentServerCaches == null) { +synchronized(this) { +if (persistentServerCaches == null) { +persistentServerCaches = buildCache(maxPersistenceTimeToLiveMs, true); +} +} +} +return persistentServerCaches; +} + +private Cache buildCache(final int ttl, final boolean isPersistent) { +return CacheBuilder.newBuilder() +.expireAfterAccess(ttl, TimeUnit.MILLISECONDS) +.ticker(getTicker()) +.removalListener(new RemovalListener (){ +@Override +public void onRemoval(RemovalNotification notification) { + if (isPersistent || !notification.getValue().getUsePersistentCache()) { + Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); --- End diff -- Fixed it here. FWIW some files mix tabs / spaces > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478166#comment-16478166 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r188783472 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java --- @@ -383,7 +384,7 @@ protected QueryPlan compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo new PTable[]{lhsTable}, new int[]{fieldPosition}, postJoinFilterExpression, QueryUtil.getOffsetLimit(limit, offset)); PairkeyRangeExpressions = new Pair (null, null); getKeyExpressionCombinations(keyRangeExpressions, context, joinTable.getStatement(), rhsTableRef, type, joinExpressions, hashExpressions); -return HashJoinPlan.create(joinTable.getStatement(), rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, hashExpressions, false, keyRangeExpressions.getFirst(), keyRangeExpressions.getSecond())}); +return HashJoinPlan.create(joinTable.getStatement(), rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, hashExpressions, false, false, keyRangeExpressions.getFirst(), keyRangeExpressions.getSecond())}); --- End diff -- @maryannxue I've changed the code here to use the `usePersistentCache` flag, but I'm not sure how to force a `HASH_BUILD_LEFT` strategy. The test I've put in the integration test file does not actually use that strategy. Do you know of an easy way to force a left table hash join? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478142#comment-16478142 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r188778970 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/memory/GlobalMemoryManager.java --- @@ -159,6 +161,10 @@ protected void finalize() throws Throwable { } private void freeMemory() { +// System.out.println("Free memory! " + size); --- End diff -- Done > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478140#comment-16478140 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r188778824 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java --- @@ -314,7 +314,8 @@ protected QueryPlan compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo if (i < count - 1) { fieldPositions[i + 1] = fieldPositions[i] + (tables[i] == null ? 0 : (tables[i].getColumns().size() - tables[i].getPKColumns().size())); } -hashPlans[i] = new HashSubPlan(i, subPlans[i], optimized ? null : hashExpressions, joinSpec.isSingleValueOnly(), keyRangeLhsExpression, keyRangeRhsExpression); +boolean usePersistentCache = joinTable.getStatement().getHint().hasHint(Hint.USE_PERSISTENT_CACHE); --- End diff -- Done > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478124#comment-16478124 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r188777276 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java --- @@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() { return memoryManager; } -private CachegetServerCaches() { +private Cache getServerCaches() { /* Delay creation of this map until it's needed */ if (serverCaches == null) { synchronized(this) { if (serverCaches == null) { -serverCaches = CacheBuilder.newBuilder() -.expireAfterAccess(maxTimeToLiveMs, TimeUnit.MILLISECONDS) -.ticker(getTicker()) -.removalListener(new RemovalListener (){ -@Override -public void onRemoval(RemovalNotification notification) { - Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); -} -}) -.build(); +serverCaches = buildCache(maxTimeToLiveMs, false); } } } return serverCaches; } + +private Cache getPersistentServerCaches() { +/* Delay creation of this map until it's needed */ +if (persistentServerCaches == null) { +synchronized(this) { +if (persistentServerCaches == null) { +persistentServerCaches = buildCache(maxPersistenceTimeToLiveMs, true); +} +} +} +return persistentServerCaches; +} + +private Cache buildCache(final int ttl, final boolean isPersistent) { +return CacheBuilder.newBuilder() +.expireAfterAccess(ttl, TimeUnit.MILLISECONDS) +.ticker(getTicker()) +.removalListener(new RemovalListener (){ +@Override +public void onRemoval(RemovalNotification notification) { + if (isPersistent || !notification.getValue().getUsePersistentCache()) { + Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); + } +} +}) +.build(); +} -@Override +private void evictInactiveEntries(long bytesNeeded) { +CacheEntry[] entries = getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{}); +Arrays.sort(entries); +long available = this.getMemoryManager().getAvailableMemory(); +for (int i = 0; i < entries.length && available < bytesNeeded; i++) { +CacheEntry entry = entries[i]; +if (!entry.isLive()) { + getServerCaches().invalidate(entry.getCacheId()); + getPersistentServerCaches().invalidate(entry.getCacheId()); +available = this.getMemoryManager().getAvailableMemory(); +} +} +} + +private CacheEntry maybeGet(ImmutableBytesPtr cacheId) { +maybePromote(cacheId); +CacheEntry entry = getServerCaches().getIfPresent(cacheId); +return entry; +} + +private void maybePromote(ImmutableBytesPtr cacheId) { +CacheEntry entry = getPersistentServerCaches().getIfPresent(cacheId); +if (entry == null) { +return; +} +getServerCaches().put(cacheId, entry); +} + +private void maybeDemote(ImmutableBytesPtr cacheId) { +CacheEntry entry = getServerCaches().getIfPresent(cacheId); +if (entry == null) { +return; +} +entry.decrementLiveQueryCount(); +if (!entry.isLive()) { +getServerCaches().invalidate(cacheId); +} +} + +public void debugPrintCaches() { + System.out.println("Live cache:" + getServerCaches()); + for (ImmutableBytesPtr key : getServerCaches().asMap().keySet()) { +
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478123#comment-16478123 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r188777161 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java --- @@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() { return memoryManager; } -private CachegetServerCaches() { +private Cache getServerCaches() { /* Delay creation of this map until it's needed */ if (serverCaches == null) { synchronized(this) { if (serverCaches == null) { -serverCaches = CacheBuilder.newBuilder() -.expireAfterAccess(maxTimeToLiveMs, TimeUnit.MILLISECONDS) -.ticker(getTicker()) -.removalListener(new RemovalListener (){ -@Override -public void onRemoval(RemovalNotification notification) { - Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); -} -}) -.build(); +serverCaches = buildCache(maxTimeToLiveMs, false); } } } return serverCaches; } + +private Cache getPersistentServerCaches() { +/* Delay creation of this map until it's needed */ +if (persistentServerCaches == null) { +synchronized(this) { +if (persistentServerCaches == null) { +persistentServerCaches = buildCache(maxPersistenceTimeToLiveMs, true); +} +} +} +return persistentServerCaches; +} + +private Cache buildCache(final int ttl, final boolean isPersistent) { +return CacheBuilder.newBuilder() +.expireAfterAccess(ttl, TimeUnit.MILLISECONDS) +.ticker(getTicker()) +.removalListener(new RemovalListener (){ +@Override +public void onRemoval(RemovalNotification notification) { + if (isPersistent || !notification.getValue().getUsePersistentCache()) { + Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); + } +} +}) +.build(); +} -@Override +private void evictInactiveEntries(long bytesNeeded) { +CacheEntry[] entries = getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{}); +Arrays.sort(entries); +long available = this.getMemoryManager().getAvailableMemory(); +for (int i = 0; i < entries.length && available < bytesNeeded; i++) { +CacheEntry entry = entries[i]; +if (!entry.isLive()) { + getServerCaches().invalidate(entry.getCacheId()); + getPersistentServerCaches().invalidate(entry.getCacheId()); +available = this.getMemoryManager().getAvailableMemory(); +} +} +} + +private CacheEntry maybeGet(ImmutableBytesPtr cacheId) { --- End diff -- Done, also removed `maybeDemote` > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476586#comment-16476586 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user ortutay commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r188461941 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java --- @@ -216,22 +234,146 @@ public void close() throws SQLException { } } } - +} + +public ServerCache checkServerCache(final byte[] cacheId, ScanRanges keyRanges, final TableRef cacheUsingTableRef, --- End diff -- Yea makes sense. For this, I'm thinking of modifying the retry logic in BaseResultIterators to disable persistent caching on retries. I'm thinking I can add a get/set'er pair to the "context" that can be used for this > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470748#comment-16470748 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user JamesRTaylor commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r187393698 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java --- @@ -216,22 +234,146 @@ public void close() throws SQLException { } } } - +} + +public ServerCache checkServerCache(final byte[] cacheId, ScanRanges keyRanges, final TableRef cacheUsingTableRef, --- End diff -- FYI, we already have a mechanism for this. If the server throws a HashJoinCacheNotFoundException, then the client will react by sending the hash cache to the region servers that don't have it. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460352#comment-16460352 ] Marcell Ortutay commented on PHOENIX-4666: -- Thanks for review; working on deploying this internally with changes, will post revisions later this week or early next > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446189#comment-16446189 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183135634 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java --- @@ -488,11 +502,20 @@ public ServerCache execute(HashJoinPlan parent) throws SQLException { if (hashExpressions != null) { ResultIterator iterator = plan.iterator(); try { -cache = -parent.hashClient.addHashCache(ranges, iterator, +final byte[] cacheId; +if (usePersistentCache) { --- End diff -- Like I said in an earlier comment, if using persistent cache, we don't need to call "check" or "add" cache here, instead we catch a specific Exception and addServerCache at this point. Still, we might wanna return "null" if this is persistent cache so later on the cache "close" method won't be called by the HashJoinPlan clean-up routine. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446185#comment-16446185 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183128249 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java --- @@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() { return memoryManager; } -private CachegetServerCaches() { +private Cache getServerCaches() { /* Delay creation of this map until it's needed */ if (serverCaches == null) { synchronized(this) { if (serverCaches == null) { -serverCaches = CacheBuilder.newBuilder() -.expireAfterAccess(maxTimeToLiveMs, TimeUnit.MILLISECONDS) -.ticker(getTicker()) -.removalListener(new RemovalListener (){ -@Override -public void onRemoval(RemovalNotification notification) { - Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); -} -}) -.build(); +serverCaches = buildCache(maxTimeToLiveMs, false); } } } return serverCaches; } + +private Cache getPersistentServerCaches() { +/* Delay creation of this map until it's needed */ +if (persistentServerCaches == null) { +synchronized(this) { +if (persistentServerCaches == null) { +persistentServerCaches = buildCache(maxPersistenceTimeToLiveMs, true); +} +} +} +return persistentServerCaches; +} + +private Cache buildCache(final int ttl, final boolean isPersistent) { +return CacheBuilder.newBuilder() +.expireAfterAccess(ttl, TimeUnit.MILLISECONDS) +.ticker(getTicker()) +.removalListener(new RemovalListener (){ +@Override +public void onRemoval(RemovalNotification notification) { + if (isPersistent || !notification.getValue().getUsePersistentCache()) { + Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); --- End diff -- Looks like the indentation is not right here. Can you perform a check through the entire PR? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446188#comment-16446188 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183136068 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/memory/GlobalMemoryManager.java --- @@ -159,6 +161,10 @@ protected void finalize() throws Throwable { } private void freeMemory() { +// System.out.println("Free memory! " + size); --- End diff -- I think this a mistake, missed from your code clean-up. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446182#comment-16446182 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183126183 --- Diff: phoenix-core/src/it/java/org/apache/phoenix/end2end/join/HashJoinPersistentCacheIT.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.phoenix.end2end.join; + +import static org.apache.phoenix.util.TestUtil.TEST_PROPERTIES; +import static org.junit.Assert.assertEquals; + +import java.sql.Connection; +import java.sql.DriverManager; +import java.sql.ResultSet; +import java.util.Properties; + +import org.apache.phoenix.end2end.join.HashJoinCacheIT.InvalidateHashCache; +import org.apache.phoenix.util.PropertiesUtil; +import org.apache.phoenix.util.SchemaUtil; +import org.apache.phoenix.util.TestUtil; +import org.junit.Test; + +public class HashJoinPersistentCacheIT extends BaseJoinIT { + +@Override +protected String getTableName(Connection conn, String virtualName) throws Exception { +String realName = super.getTableName(conn, virtualName); +TestUtil.addCoprocessor(conn, SchemaUtil.normalizeFullTableName(realName), InvalidateHashCache.class); +return realName; +} + +@Test +public void testPersistentCache() throws Exception { + Properties props = PropertiesUtil.deepCopy(TEST_PROPERTIES); + Connection conn = DriverManager.getConnection(getUrl(), props); + + createTestTable(getUrl(), "CREATE TABLE IF NOT EXISTS states (state CHAR(2) NOT NULL, name VARCHAR NOT NULL CONSTRAINT my_pk PRIMARY KEY (state, name))"); + createTestTable(getUrl(), "CREATE TABLE IF NOT EXISTS cities (state CHAR(2) NOT NULL, city VARCHAR NOT NULL, population BIGINT CONSTRAINT my_pk PRIMARY KEY (state, city))"); + + conn.prepareStatement("UPSERT INTO states VALUES ('CA', 'California')").executeUpdate(); + conn.prepareStatement("UPSERT INTO states VALUES ('AZ', 'Arizona')").executeUpdate(); + conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 'San Francisco', 5)").executeUpdate(); + conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 'Sacramento', 3000)").executeUpdate(); + conn.prepareStatement("UPSERT INTO cities VALUES ('AZ', 'Phoenix', 2)").executeUpdate(); + conn.commit(); + + /* First, run query without using the persistent cache. This should return + * different results after an UPSERT takes place. + */ + ResultSet rs = conn.prepareStatement("SELECT SUM(population) FROM states s JOIN cities c ON c.state = s.state").executeQuery(); + rs.next(); + int population1 = rs.getInt(1); + + conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 'Mt View', 1500)").executeUpdate(); + conn.commit(); + rs = conn.prepareStatement("SELECT SUM(population) FROM states s JOIN cities c ON c.state = s.state").executeQuery(); + rs.next(); + int population2 = rs.getInt(1); + + assertEquals(73000, population1); + assertEquals(74500, population2); + + /* Second, run query using the persistent cache. This should return the + * same results after an UPSERT takes place. + */ + rs = conn.prepareStatement("SELECT /*+ USE_PERSISTENT_CACHE */ SUM(population) FROM states s JOIN cities c ON c.state = s.state").executeQuery(); + rs.next(); + int population3 = rs.getInt(1); + + conn.prepareStatement("UPSERT INTO cities VALUES ('CA', 'Palo Alto', 2000)").executeUpdate(); + conn.commit(); + rs = conn.prepareStatement("SELECT /*+ USE_PERSISTENT_CACHE */ SUM(population) FROM states s JOIN
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446187#comment-16446187 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183128592 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java --- @@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() { return memoryManager; } -private CachegetServerCaches() { +private Cache getServerCaches() { /* Delay creation of this map until it's needed */ if (serverCaches == null) { synchronized(this) { if (serverCaches == null) { -serverCaches = CacheBuilder.newBuilder() -.expireAfterAccess(maxTimeToLiveMs, TimeUnit.MILLISECONDS) -.ticker(getTicker()) -.removalListener(new RemovalListener (){ -@Override -public void onRemoval(RemovalNotification notification) { - Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); -} -}) -.build(); +serverCaches = buildCache(maxTimeToLiveMs, false); } } } return serverCaches; } + +private Cache getPersistentServerCaches() { +/* Delay creation of this map until it's needed */ +if (persistentServerCaches == null) { +synchronized(this) { +if (persistentServerCaches == null) { +persistentServerCaches = buildCache(maxPersistenceTimeToLiveMs, true); +} +} +} +return persistentServerCaches; +} + +private Cache buildCache(final int ttl, final boolean isPersistent) { +return CacheBuilder.newBuilder() +.expireAfterAccess(ttl, TimeUnit.MILLISECONDS) +.ticker(getTicker()) +.removalListener(new RemovalListener (){ +@Override +public void onRemoval(RemovalNotification notification) { + if (isPersistent || !notification.getValue().getUsePersistentCache()) { + Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); + } +} +}) +.build(); +} -@Override +private void evictInactiveEntries(long bytesNeeded) { +CacheEntry[] entries = getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{}); +Arrays.sort(entries); +long available = this.getMemoryManager().getAvailableMemory(); +for (int i = 0; i < entries.length && available < bytesNeeded; i++) { +CacheEntry entry = entries[i]; +if (!entry.isLive()) { + getServerCaches().invalidate(entry.getCacheId()); + getPersistentServerCaches().invalidate(entry.getCacheId()); +available = this.getMemoryManager().getAvailableMemory(); +} +} +} + +private CacheEntry maybeGet(ImmutableBytesPtr cacheId) { --- End diff -- Think we can just call it "get" or "getIfPresent" and merge {{maybePromote}} into this call. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446184#comment-16446184 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183133375 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java --- @@ -314,7 +314,8 @@ protected QueryPlan compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo if (i < count - 1) { fieldPositions[i + 1] = fieldPositions[i] + (tables[i] == null ? 0 : (tables[i].getColumns().size() - tables[i].getPKColumns().size())); } -hashPlans[i] = new HashSubPlan(i, subPlans[i], optimized ? null : hashExpressions, joinSpec.isSingleValueOnly(), keyRangeLhsExpression, keyRangeRhsExpression); +boolean usePersistentCache = joinTable.getStatement().getHint().hasHint(Hint.USE_PERSISTENT_CACHE); --- End diff -- We can make "usePersistentCache" a member of QueryCompiler and initialize it in the beginning just like "noChildParentJoinOptimization". > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446186#comment-16446186 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183133702 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/compile/QueryCompiler.java --- @@ -383,7 +384,7 @@ protected QueryPlan compileJoinQuery(JoinCompiler.Strategy strategy, StatementCo new PTable[]{lhsTable}, new int[]{fieldPosition}, postJoinFilterExpression, QueryUtil.getOffsetLimit(limit, offset)); PairkeyRangeExpressions = new Pair (null, null); getKeyExpressionCombinations(keyRangeExpressions, context, joinTable.getStatement(), rhsTableRef, type, joinExpressions, hashExpressions); -return HashJoinPlan.create(joinTable.getStatement(), rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, hashExpressions, false, keyRangeExpressions.getFirst(), keyRangeExpressions.getSecond())}); +return HashJoinPlan.create(joinTable.getStatement(), rhsPlan, joinInfo, new HashSubPlan[]{new HashSubPlan(0, lhsPlan, hashExpressions, false, false, keyRangeExpressions.getFirst(), keyRangeExpressions.getSecond())}); --- End diff -- This is also hash-join but with left table as build side. I think we should be able to use persistent cache as well. And could you also add another test to cover this point? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446183#comment-16446183 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183130270 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/TenantCacheImpl.java --- @@ -77,57 +153,164 @@ public MemoryManager getMemoryManager() { return memoryManager; } -private CachegetServerCaches() { +private Cache getServerCaches() { /* Delay creation of this map until it's needed */ if (serverCaches == null) { synchronized(this) { if (serverCaches == null) { -serverCaches = CacheBuilder.newBuilder() -.expireAfterAccess(maxTimeToLiveMs, TimeUnit.MILLISECONDS) -.ticker(getTicker()) -.removalListener(new RemovalListener (){ -@Override -public void onRemoval(RemovalNotification notification) { - Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); -} -}) -.build(); +serverCaches = buildCache(maxTimeToLiveMs, false); } } } return serverCaches; } + +private Cache getPersistentServerCaches() { +/* Delay creation of this map until it's needed */ +if (persistentServerCaches == null) { +synchronized(this) { +if (persistentServerCaches == null) { +persistentServerCaches = buildCache(maxPersistenceTimeToLiveMs, true); +} +} +} +return persistentServerCaches; +} + +private Cache buildCache(final int ttl, final boolean isPersistent) { +return CacheBuilder.newBuilder() +.expireAfterAccess(ttl, TimeUnit.MILLISECONDS) +.ticker(getTicker()) +.removalListener(new RemovalListener (){ +@Override +public void onRemoval(RemovalNotification notification) { + if (isPersistent || !notification.getValue().getUsePersistentCache()) { + Closeables.closeAllQuietly(Collections.singletonList(notification.getValue())); + } +} +}) +.build(); +} -@Override +private void evictInactiveEntries(long bytesNeeded) { +CacheEntry[] entries = getPersistentServerCaches().asMap().values().toArray(new CacheEntry[]{}); +Arrays.sort(entries); +long available = this.getMemoryManager().getAvailableMemory(); +for (int i = 0; i < entries.length && available < bytesNeeded; i++) { +CacheEntry entry = entries[i]; +if (!entry.isLive()) { + getServerCaches().invalidate(entry.getCacheId()); + getPersistentServerCaches().invalidate(entry.getCacheId()); +available = this.getMemoryManager().getAvailableMemory(); +} +} +} + +private CacheEntry maybeGet(ImmutableBytesPtr cacheId) { +maybePromote(cacheId); +CacheEntry entry = getServerCaches().getIfPresent(cacheId); +return entry; +} + +private void maybePromote(ImmutableBytesPtr cacheId) { +CacheEntry entry = getPersistentServerCaches().getIfPresent(cacheId); +if (entry == null) { +return; +} +getServerCaches().put(cacheId, entry); +} + +private void maybeDemote(ImmutableBytesPtr cacheId) { +CacheEntry entry = getServerCaches().getIfPresent(cacheId); +if (entry == null) { +return; +} +entry.decrementLiveQueryCount(); +if (!entry.isLive()) { +getServerCaches().invalidate(cacheId); +} +} + +public void debugPrintCaches() { + System.out.println("Live cache:" + getServerCaches()); + for (ImmutableBytesPtr key : getServerCaches().asMap().keySet()) {
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446181#comment-16446181 ] ASF GitHub Bot commented on PHOENIX-4666: - Github user maryannxue commented on a diff in the pull request: https://github.com/apache/phoenix/pull/298#discussion_r183127442 --- Diff: phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java --- @@ -216,22 +234,146 @@ public void close() throws SQLException { } } } - +} + +public ServerCache checkServerCache(final byte[] cacheId, ScanRanges keyRanges, final TableRef cacheUsingTableRef, --- End diff -- I am thinking here, can we do this differently? Instead of making RPC calls of "checkServerCache" for the first and every subsequent queries, we do NOT make any calls, neither "check" or "add" when the persistent-cache hint is available and catches the {{PersistentCacheNotFoundException}} on the first attempt (or later attempts if somehow the cache has been evicted) and then try adding the cache all over again. I think it will be more efficient in general. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446159#comment-16446159 ] Marcell Ortutay commented on PHOENIX-4666: -- thanks [~maryannxue] > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445048#comment-16445048 ] Maryann Xue commented on PHOENIX-4666: -- Thank you for the PR, [~ortutay]! Looking at it now. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441554#comment-16441554 ] ASF GitHub Bot commented on PHOENIX-4666: - GitHub user ortutay opened a pull request: https://github.com/apache/phoenix/pull/298 PHOENIX-4666 Persistent subquery cache for hash joins You can merge this pull request into a Git repository by running: $ git pull https://github.com/ortutay/phoenix PHOENIX-4666-subquery-cache Alternatively you can review and apply these changes as the patch at: https://github.com/apache/phoenix/pull/298.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #298 commit 77460c37697b2a112cf6ed345356a16da08dd51c Author: Marcell OrtutayDate: 2018-03-29T19:59:03Z PHOENIX-4666 Persistent subquery cache for hash joins commit d1fc310e3d0df772c0aeb1673d2b64d01f495d27 Author: Marcell Ortutay Date: 2018-04-17T20:51:00Z PHOENIX-4666 Add tests for TenantCacheTest > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438067#comment-16438067 ] Maryann Xue commented on PHOENIX-4666: -- Sounds good! Please remember to add more tests. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438013#comment-16438013 ] Marcell Ortutay commented on PHOENIX-4666: -- FYI added hint-to-enable ("USE_PERSISTENT_CACHE") to my fork. This is the last "feature" that I think is needed for a v1. I'm going to clean things up a bit and refactor for the noChildParentJoinOptimization approach suggested above and then submit for a full review. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437901#comment-16437901 ] Marcell Ortutay commented on PHOENIX-4666: -- Got it–for option (2) that would basically require calling setScanRange() with argument to scan "all"? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437881#comment-16437881 ] Maryann Xue commented on PHOENIX-4666: -- Thank you for explaining 2! And now it's clear to me. It is actually an optimization implemented in PHOENIX-852. It doesn't not apply to all join queries, so I guess in the test case you have there this optimization is triggered. So I'm thinking two options here: 1) Call {{HashCacheClient#evaluateKeyExpression()}} to get the key ranges if cache is already available on the server side, in which case {{CachedSubqueryResultIterator}} would still be needed but we do not add cache one more time. We can have a client-side cache for such key-range values as well. And if this is the first client building the cache for the first time, we get these values from calling {{addHashCache()}} and then cache them on the client side. 2) A more radical but easier approach is to disable this "child-parent (FK-PK) join optimization" when using persistent cache. This makes some practical sense: if we can make a big performance gain from avoiding rebuilding the hash cache, it could indicate that the cache itself might be of some considerable side, and thus the key ranges generated from a relatively large amount of values might not be that useful to narrow down the scan anyway. For now, I actually prefer the second approach, since we can focus on the main part of this issue and move forward faster. For 5: Just for simplicity. Feel like we can have less getters and "get" calls here. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437759#comment-16437759 ] Marcell Ortutay commented on PHOENIX-4666: -- Thanks for taking a look [~maryannxue]. responses below: > 1. First of all, I think it's important that we have an option to enable and >disable the persistent cache, making sure that users can still run join >queries in the default temp-cache way. Yes definitely. In fact I am adding a hint, and for now I think it makes sense to only enable it if that hint is there, so we don't break any existing behavior. > 2. Regarding to your change [2], can you explain what exactly is the problem >of key-range generation? Looks like checkCache() and addCache() are doing >redundant work, and CachedSubqueryResultIterator should be unnecessary. We do >not wish to read the cache on the client side and then re-add the cache again. Yes in my first attempt at this I did not have the redundant work, but I ran into a bug where I was getting empty results when using the cached code path. If you look at ServerCache.addHashCache() you'll notice that it calls serialize(): [https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L85] serialize() produces that serialized RHS join cache and iterates over all the results in the ResultIterator for the RHS query. In this line [https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131] it adds entries to keyRangeRhsValues. This is a list of the key values on the RHS of the join. It is used here [https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java#L226] in HashJoinPlan as part of the query, and at some point it becomes an argument to set the scan range (I can dig up where if you'd like). For this reason the cached code path somehow needs to generate correct values for keyRangeRhsValues, or correct values for the scan range, and these values need to be available on the client side. The approach I did just re-runs the same codepath for both no-cache and cached queries. The advantage is it was fairly simple to implement and it guarantees identical execution. The downside is the redundant work. It would also be possible to have special case code to set the scan range for cached queries. This is a bit harder to implement but is more efficient. Happy to hear what people think about this. Maybe there is something much simpler that I am missing! > 3. We need to be aware that the string representation of the sub-query >statement is not reliable, which means the same join-tables or sub-queries do >not necessarily map to the same string representation, and thus will have >different generated cache-id. It'd be optimal if we can have some >normalization here. We can consider leaving this as a future improvement, yet >at this point we'd better have some test cases (counter cases as well) to >cover this point. Yes, definitely. I'd prefer to leave this as a future improvement to keep the initial PR focused. IIRC I saw for some complex queries there is a "$1" or "$2" placeholder, which changes even across identical queries. There are probably more things like this, eg. "x=10 AND y=20" is the same as "y=20 AND x=10". > 4. Is there a way for us to update the cache content if tables have been >updated? This might be related to what approach we take to add and re-validate >cache in (2). Currently, no. I was thinking though that the user application can control invalidation using a hint, like this: /*+ CACHE_PERSISTENT('2018-01-01 12:00') */ The '2018-01-01 12:00' would be a suffix to whatever cacheId we generate, like this: cacheId = hash(cacheId + '2018-01-01 12:00') which lets the application explicitly invalidate the cache when needed. > 5. A rather minor point as it just occurred to me: Can we have CacheEntry >implement Closable? Yes. Just so I know, what is the benefit of this? And yes, apologies for the messy code. I'm fixing it up today and it should be ready for a more thorough review today or tomorrow. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436978#comment-16436978 ] Maryann Xue commented on PHOENIX-4666: -- Thank you very much for your work, [~ortutay]! Here's a few things on the high-level: 1. First of all, I think it's important that we have an option to enable and disable the persistent cache, making sure that users can still run join queries in the default temp-cache way. 2. Regarding to your change [2], can you explain what exactly is the problem of key-range generation? Looks like checkCache() and addCache() are doing redundant work, and CachedSubqueryResultIterator should be unnecessary. We do not wish to read the cache on the client side and then re-add the cache again. 3. We need to be aware that the string representation of the sub-query statement is not reliable, which means the same join-tables or sub-queries do not necessarily map to the same string representation, and thus will have different generated cache-id. It'd be optimal if we can have some normalization here. We can consider leaving this as a future improvement, yet at this point we'd better have some test cases (counter cases as well) to cover this point. 4. Is there a way for us to update the cache content if tables have been updated? This might be related to what approach we take to add and re-validate cache in (2). 5. A rather minor point as it just occurred to me: Can we have CacheEntry implement Closable? Lastly, I understand that it's work in progress, but as we move on, could you please do a little clean-up so it would be easier for discussions and code reviews? For example, correct the indentation (make sure there's no tabs); instead of commenting out a line of code, can you just remove it; and get rid of all "system.out.println" or replace them with logging if necessary? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434855#comment-16434855 ] Maryann Xue commented on PHOENIX-4666: -- Yes, sure. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434844#comment-16434844 ] James Taylor commented on PHOENIX-4666: --- Would you have any spare cycles to take a look, [~maryannxue]? > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431546#comment-16431546 ] Marcell Ortutay commented on PHOENIX-4666: -- A update on this: I've implemented a basic version of this that re-uses the RHS results in a subquery cache. I made a few changes to the original hacky implementation that I wanted to get some feedback on. My code is here: [https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please note this is a work in progress. I've changed the following things: # In my first implementation, I stored a mapping of subquery hash -> ServerCache client side. This works in the single client use case but doesn't work if you have a cluster of PQS servers (which is our situation at 23andMe). So instead I replaced this with an RPC mechanism. The client will send an RPC to each region server, and check if the subquery results are available. # Originally I planned to only return a boolean in the RPC check. However, I ran into an issue. It turns out that the serialize() method is involved in the generation of key ranges that are used in the query [1]. This serialize() method is in the addHashCache() code path. In order to make sure this code is hit, I am creating a CachedSubqueryResultIterator which is passed to the addHashCache() code path. This ensures that all side effects, like the key range generation, are the same between cached / uncached code paths. Would love to get feedback on this approach. For (2) there is an alternate approach that also caches the key ranges. This is more efficient but has the downside of needing specialized code. Work still left to do is eviction logic, and hint to enable, and general cleanup/testing. [1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131 > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221 ] Marcell Ortutay commented on PHOENIX-4666: -- Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly simple, and it can be expanded in follow-up patches. To start, here is what I would propose: # Use existing server cache, with option to keep around data past a single query. There would be a new "keep around TTL" that sets the max time an entry is kept around. The inter-query data may be evicted if space is needed. Data being used for a "live" query is track as such, and is never evicted (keep current Exception behavior) # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only activated if this hint is present. This hint also has an optional cache key suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23') */ which can be used by the application to explicitly expire a cache, in case TTL does not give enough control # Cache eviction uses some sort of priority queue / LRU type system. Simple ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry Things that will be left for future work: # Additional config/control around when to use subquery cache, eg. global control, or a table level control, or table timestamp based controls # Use of Apache Arrow for serialization (instead of existing HashCacheClient.serialize() method) # Persistent cache separate from HBase coprocessor system I'm going to start work on this next week, and hopefully will have a patch by end of the week for initial review > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408984#comment-16408984 ] James Taylor commented on PHOENIX-4666: --- I like the simplicity of your design, [~ortutay], in using the hash as the cache ID. You could just assume that the cache is already available and react to the exception you get back by generating the cache if it's not. That way you'd need no mapping at all (and no central place to check if the cache ID maps to an existing cache). That flow is already there, but you'd need to add the logic to generate the cache as the current code assumes that it has already built the cache (i.e. this handles the situation in which a region splits and the new RS doesn't have the cache yet). Some considerations: - at a minimum, we could have a global config for the TTL of the cache when this feature is enabled (so that it'd be a different config than the standard TTL config). - at the finest granularity, you could even create a new hint that specifies the TTL so you could specify it per query. - we'd want to make it clear in documentation that the cache data would be stale once generated (until the TTL expires it). - might consider having a new table level property on which this feature could be enabled (or a table-specific TTL could be specified) > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408720#comment-16408720 ] Marcell Ortutay commented on PHOENIX-4666: -- Thanks for the input [~maryannxue]. My current implementation is here: [https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in particular this is my hacky patch: [https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af] . Instead of a random cache ID it takes a hash of the query statement and uses that as the cache ID. Each Phoenix client maintains it's own memory of which cache IDs have already been executed (this is not ideal, but it was easy to implement this way). If I'm understanding your proposal, the Phoenix client would attempt to use a cache ID with the expectation that it exists on region servers. The region server would throw an exception if the cache ID is not found, which indicates to Phoenix client that it should evaluate the subquery as usual. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408676#comment-16408676 ] Maryann Xue commented on PHOENIX-4666: -- Here's my initial thought: # As opposed to maintaining a map from a subquery to a persistent cache, we could have a map from "cache_id" to a persistent cache. What we are doing now for "cache_id" should be good enough - client-side-generated unique id. # We'll have a certain property to indicate the use of persistent cache, so that a HashJoinPlan will re-use the "cache_id" (after it being generated on the first call) without having to re-send and re-build the caches on region servers. # User will have to hold on to the compiled statement and re-execute it. # There's no need for a centralized cache management at this point. A region server will throw a specific Exception if it is unable to find the cache requested, then on the client side, the sub-query will be re-evaluated and the cache will broadcasted to all region servers again with the same cache-id. Not sure if is similar to your version, but would be nice to see whatever you already have and we can go from there. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408666#comment-16408666 ] Marcell Ortutay commented on PHOENIX-4666: -- As I mentioned above we’re working on a design proposal for this internally at 23andMe, and there’s one big decision that I wanted to get feedback on. There is currently “server cache” that is used by the hash join process in Phoenix. Hash join tables are broadcast to all region servers that need it, and the hash joining happens via coprocessor. This cache is deleted after the query ends. My first thought for a persistent cache was to re-use the server cache, and extend the TTL and change the key (“cacheId”) generation. I implemented this as a hacky proof-of-concept and it worked quite well, the performance was much improved. However, I’m wondering if a separate cache makes more sense. The current server cache has a different use case than a persistent cache, and as such it may be a good idea to separate the two. Some ways in which they are different: - A persistent cache performs eviction when there is no space left. The server cache raises an exception, and the user must do a merge sort join instead. - Users may want to configure the two differently, eg. allocate more space for a persistent cache than the server cache, and set a higher TTL - The server cache data must be available on all region servers doing the hash join. In contrast, the persistent cache only needs 1 copy of the data across the system (ie. across all region servers) until the data is needed. Doing this would be more space efficient, but result in more network transfer. - You could in theory have a pluggable system for the persistent cache, eg. use memcache or something That said, there are advantages to keeping it all in the server cache: - Simpler implementation, does not add a new system to Phoenix - Faster in the case that you get a cache hit, since there is no network transfer involved Would love to get some feedback / opinions on this, thanks! > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)