[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437921#comment-16437921 ] Marcell Ortutay edited comment on PHOENIX-4666 at 4/13/18 9:10 PM: --- Ah! got it, thanks was (Author: ortutay): Ah! got it, thank > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431546#comment-16431546 ] Marcell Ortutay edited comment on PHOENIX-4666 at 4/10/18 12:20 AM: A update on this: I've implemented a basic version of this that re-uses the RHS results in a subquery cache. I made a few changes to the original hacky implementation that I wanted to get some feedback on. My code is here: [https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please note this is a work in progress. I've changed the following things: # In my first implementation, I stored a mapping of subquery hash -> ServerCache client side. This works in the single client use case but doesn't work if you have a cluster of PQS servers (which is our situation at 23andMe). So instead I replaced this with an RPC mechanism. The client will send an RPC to each region server, and check if the subquery results are available. # Originally I planned to only return a boolean in the RPC check. However, I ran into an issue. It turns out that the serialize() method is involved in the generation of key ranges that are used in the query [1]. This serialize() method is in the addHashCache() code path. In order to make sure this code is hit, I am creating a CachedSubqueryResultIterator which is passed to the addHashCache() code path. This ensures that all side effects, like the key range generation, are the same between cached / uncached code paths. Would love to get feedback on this approach. For (2) there is an alternate approach that also caches the key ranges. This is more efficient but has the downside of needing specialized code. Work still left to do is eviction logic, and hint to enable, and general cleanup/testing. [1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131 was (Author: ortutay): A update on this: I've implemented a basic version of this that re-uses the RHS results in a subquery cache. I made a few changes to the original hacky implementation that I wanted to get some feedback on. My code is here: [https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please note this is a work in progress. I've changed the following things: # In my first implementation, I stored a mapping of subquery hash -> ServerCache client side. This works in the single client use case but doesn't work if you have a cluster of PQS servers (which is our situation at 23andMe). So instead I replaced this with an RPC mechanism. The client will send an RPC to each region server, and check if the subquery results are available. # Originally I planned to only return a boolean in the RPC check. However, I ran into an issue. It turns out that the serialize() method is involved in the generation of key ranges that are used in the query [1]. This serialize() method is in the addHashCache() code path. In order to make sure this code is hit, I am creating a CachedSubqueryResultIterator which is passed to the addHashCache() code path. This ensures that all side effects, like the key range generation, are the same between cached / uncached code paths. Would love to get feedback on this approach. For (2) there is an alternate approach that also caches the key ranges. This is more efficient but has the downside of needing specialized code. Work still left to do is eviction logic, and hint to enable, and general cleanup/testing. [1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131 > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM
[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221 ] Marcell Ortutay edited comment on PHOENIX-4666 at 3/23/18 11:01 PM: Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly simple, and it can be expanded in follow-up patches. To start, here is what I would propose: # Use existing server cache, with option to keep around data past a single query. There would be a new "keep around TTL" that sets the max time an entry is kept around. The inter-query data may be evicted if space is needed. Data being used for a "live" query is track as such, and is never evicted (keep current Exception behavior) # Subquery cache is triggered with a /\*\- SUBQUERY_CACHE \*/ hint, and is only activated if this hint is present. This hint also has an optional cache key suffix, eg.: /\*\- SUBQUERY_CACHE('2018-03-23') \*/ which can be used by the application to explicitly expire a cache, in case TTL does not give enough control # Cache eviction uses some sort of priority queue / LRU type system. Simple ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry Things that will be left for future work: # Additional config/control around when to use subquery cache, eg. global control, or a table level control, or table timestamp based controls # Use of Apache Arrow for serialization (instead of existing HashCacheClient.serialize() method) # Persistent cache separate from HBase coprocessor system I'm going to start work on this next week, and hopefully will have a patch by end of the week for initial review was (Author: ortutay): Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly simple, and it can be expanded in follow-up patches. To start, here is what I would propose: # Use existing server cache, with option to keep around data past a single query. There would be a new "keep around TTL" that sets the max time an entry is kept around. The inter-query data may be evicted if space is needed. Data being used for a "live" query is track as such, and is never evicted (keep current Exception behavior) # Subquery cache is triggered with a /+\* SUBQUERY_CACHE \*/ hint, and is only activated if this hint is present. This hint also has an optional cache key suffix, eg.: /+\* SUBQUERY_CACHE('2018-03-23') \*/ which can be used by the application to explicitly expire a cache, in case TTL does not give enough control # Cache eviction uses some sort of priority queue / LRU type system. Simple ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry Things that will be left for future work: # Additional config/control around when to use subquery cache, eg. global control, or a table level control, or table timestamp based controls # Use of Apache Arrow for serialization (instead of existing HashCacheClient.serialize() method) # Persistent cache separate from HBase coprocessor system I'm going to start work on this next week, and hopefully will have a patch by end of the week for initial review > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not
[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221 ] Marcell Ortutay edited comment on PHOENIX-4666 at 3/23/18 11:01 PM: Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly simple, and it can be expanded in follow-up patches. To start, here is what I would propose: # Use existing server cache, with option to keep around data past a single query. There would be a new "keep around TTL" that sets the max time an entry is kept around. The inter-query data may be evicted if space is needed. Data being used for a "live" query is track as such, and is never evicted (keep current Exception behavior) # Subquery cache is triggered with a /+\* SUBQUERY_CACHE \*/ hint, and is only activated if this hint is present. This hint also has an optional cache key suffix, eg.: /+\* SUBQUERY_CACHE('2018-03-23') \*/ which can be used by the application to explicitly expire a cache, in case TTL does not give enough control # Cache eviction uses some sort of priority queue / LRU type system. Simple ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry Things that will be left for future work: # Additional config/control around when to use subquery cache, eg. global control, or a table level control, or table timestamp based controls # Use of Apache Arrow for serialization (instead of existing HashCacheClient.serialize() method) # Persistent cache separate from HBase coprocessor system I'm going to start work on this next week, and hopefully will have a patch by end of the week for initial review was (Author: ortutay): Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly simple, and it can be expanded in follow-up patches. To start, here is what I would propose: # Use existing server cache, with option to keep around data past a single query. There would be a new "keep around TTL" that sets the max time an entry is kept around. The inter-query data may be evicted if space is needed. Data being used for a "live" query is track as such, and is never evicted (keep current Exception behavior) # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only activated if this hint is present. This hint also has an optional cache key suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23') */ which can be used by the application to explicitly expire a cache, in case TTL does not give enough control # Cache eviction uses some sort of priority queue / LRU type system. Simple ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry Things that will be left for future work: # Additional config/control around when to use subquery cache, eg. global control, or a table level control, or table timestamp based controls # Use of Apache Arrow for serialization (instead of existing HashCacheClient.serialize() method) # Persistent cache separate from HBase coprocessor system I'm going to start work on this next week, and hopefully will have a patch by end of the week for initial review > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Assignee: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted
[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408984#comment-16408984 ] James Taylor edited comment on PHOENIX-4666 at 3/22/18 3:20 AM: I like the simplicity of your design, [~ortutay], in using the hash as the cache ID. You could just assume that the cache is already available and react to the exception you get back by generating the cache if it's not. That way you'd need no mapping at all (and no central place to check if the cache ID maps to an existing cache). That flow is already there, but you'd need to add the logic to generate the cache as the current code assumes that it has already built the cache (i.e. this handles the situation in which a region splits and the new RS doesn't have the cache yet). Some considerations: - at a minimum, we could have a global config for the TTL of the cache when this feature is enabled (so that it'd be a different config than the standard TTL config). - at the finest granularity, you could even create a new hint that specifies the TTL so you could specify it per query. - we'd want to make it clear in documentation that the cache data would be stale once generated (until the TTL expires it). - might consider having a new table level property on which this feature could be enabled (or a table-specific TTL could be specified) - might consider in the future using a format like Apache Arrow to represent the hash join cache data - might consider off heap memory for hash join cache - persistent cache could be future work (or you could put interfaces in place that could be replaced) was (Author: jamestaylor): I like the simplicity of your design, [~ortutay], in using the hash as the cache ID. You could just assume that the cache is already available and react to the exception you get back by generating the cache if it's not. That way you'd need no mapping at all (and no central place to check if the cache ID maps to an existing cache). That flow is already there, but you'd need to add the logic to generate the cache as the current code assumes that it has already built the cache (i.e. this handles the situation in which a region splits and the new RS doesn't have the cache yet). Some considerations: - at a minimum, we could have a global config for the TTL of the cache when this feature is enabled (so that it'd be a different config than the standard TTL config). - at the finest granularity, you could even create a new hint that specifies the TTL so you could specify it per query. - we'd want to make it clear in documentation that the cache data would be stale once generated (until the TTL expires it). - might consider having a new table level property on which this feature could be enabled (or a table-specific TTL could be specified) > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira
[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query
[ https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408720#comment-16408720 ] Marcell Ortutay edited comment on PHOENIX-4666 at 3/21/18 10:46 PM: Thanks for the input [~maryannxue]. My current implementation is here: [https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in particular this is my hacky patch: [https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af] . (Implemented on 4.7 for unrelated reasons, but the same idea I think is transferable to HEAD) Instead of a random cache ID it takes a hash of the query statement and uses that as the cache ID. Each Phoenix client maintains it's own memory of which cache IDs have already been executed (this is not ideal, but it was easy to implement this way). If I'm understanding your proposal, the Phoenix client would attempt to use a cache ID with the expectation that it exists on region servers. The region server would throw an exception if the cache ID is not found, which indicates to Phoenix client that it should evaluate the subquery as usual. was (Author: ortutay): Thanks for the input [~maryannxue]. My current implementation is here: [https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in particular this is my hacky patch: [https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af] . Instead of a random cache ID it takes a hash of the query statement and uses that as the cache ID. Each Phoenix client maintains it's own memory of which cache IDs have already been executed (this is not ideal, but it was easy to implement this way). If I'm understanding your proposal, the Phoenix client would attempt to use a cache ID with the expectation that it exists on region servers. The region server would throw an exception if the cache ID is not found, which indicates to Phoenix client that it should evaluate the subquery as usual. > Add a subquery cache that persists beyond the life of a query > - > > Key: PHOENIX-4666 > URL: https://issues.apache.org/jira/browse/PHOENIX-4666 > Project: Phoenix > Issue Type: Improvement >Reporter: Marcell Ortutay >Priority: Major > > The user list thread for additional context is here: > [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E] > > A Phoenix query may contain expensive subqueries, and moreover those > expensive subqueries may be used across multiple different queries. While > whole result caching is possible at the application level, it is not possible > to cache subresults in the application. This can cause bad performance for > queries in which the subquery is the most expensive part of the query, and > the application is powerless to do anything at the query level. It would be > good if Phoenix provided a way to cache subquery results, as it would provide > a significant performance gain. > An illustrative example: > SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) > expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = > \{id} > In this case, the subquery "expensive_result" is expensive to compute, but it > doesn't change between queries. The rest of the query does because of the > \{id} parameter. This means the application can't cache it, but it would be > good if there was a way to cache expensive_result. > Note that there is currently a coprocessor based "server cache", but the data > in this "cache" is not persisted across queries. It is deleted after a TTL > expires (30sec by default), or when the query completes. > This is issue is fairly high priority for us at 23andMe and we'd be happy to > provide a patch with some guidance from Phoenix maintainers. We are currently > putting together a design document for a solution, and we'll post it to this > Jira ticket for review in a few days. -- This message was sent by Atlassian JIRA (v7.6.3#76005)