[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-13 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16437921#comment-16437921
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 4/13/18 9:10 PM:
---

Ah! got it, thanks


was (Author: ortutay):
Ah! got it, thank

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-04-09 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431546#comment-16431546
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 4/10/18 12:20 AM:


A update on this: I've implemented a basic version of this that re-uses the RHS 
results in a subquery cache. I made a few changes to the original hacky 
implementation that I wanted to get some feedback on.

My code is here: 
[https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please 
note this is a work in progress.

I've changed the following things:
 # In my first implementation, I stored a mapping of subquery hash -> 
ServerCache client side. This works in the single client use case but doesn't 
work if you have a cluster of PQS servers (which is our situation at 23andMe). 
So instead I replaced this with an RPC mechanism. The client will send an RPC 
to each region server, and check if the subquery results are available.
 # Originally I planned to only return a boolean in the RPC check. However, I 
ran into an issue. It turns out that the serialize() method is involved in the 
generation of key ranges that are used in the query [1]. This serialize() 
method is in the addHashCache() code path. In order to make sure this code is 
hit, I am creating a CachedSubqueryResultIterator which is passed to the 
addHashCache() code path. This ensures that all side effects, like the key 
range generation, are the same between cached / uncached code paths.

Would love to get feedback on this approach. For (2) there is an alternate 
approach that also caches the key ranges. This is more efficient but has the 
downside of needing specialized code.

Work still left to do is eviction logic, and hint to enable, and general 
cleanup/testing.

[1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131


was (Author: ortutay):
A update on this: I've implemented a basic version of this that re-uses the RHS 
results in a subquery cache. I made a few changes to the original hacky 
implementation that I wanted to get some feedback on.

My code is here: 
[https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please 
note this is a work in progress.

I've changed the following things:
 # In my first implementation, I stored a mapping of subquery hash -> 
ServerCache client side. This works in the single client use case but doesn't 
work if you have a cluster of PQS servers (which is our situation at 23andMe). 
So instead I replaced this with an RPC mechanism. The client will send an RPC 
to each region server, and check if the subquery results are available.
 # Originally I planned to only return a boolean in the RPC check. However, I 
ran into an issue. It turns out that the serialize() method is involved in the 
generation of key ranges that are used in the query [1]. This serialize() 
method is in the addHashCache() code path. In order to make sure this code is 
hit, I am creating a CachedSubqueryResultIterator which is passed to the 
addHashCache() code path. This ensures that all side effects, like the key 
range generation, are the same between cached / uncached code paths.

Would love to get feedback on this approach. For (2) there is an alternate 
approach that also caches the key ranges. This is more efficient but has the 
downside of needing specialized code.

Work still left to do is eviction logic, and hint to enable, and general 
cleanup/testing.

[1]https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM 

[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-23 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 3/23/18 11:01 PM:


Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /\*\- SUBQUERY_CACHE \*/ hint, and is 
only activated if this hint is present. This hint also has an optional cache 
key suffix, eg.: /\*\- SUBQUERY_CACHE('2018-03-23') \*/ which can be used by 
the application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review


was (Author: ortutay):
Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /+\* SUBQUERY_CACHE \*/ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /+\* SUBQUERY_CACHE('2018-03-23') \*/ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not 

[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-23 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412221#comment-16412221
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 3/23/18 11:01 PM:


Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /+\* SUBQUERY_CACHE \*/ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /+\* SUBQUERY_CACHE('2018-03-23') \*/ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review


was (Author: ortutay):
Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23') */ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Assignee: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted 

[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408984#comment-16408984
 ] 

James Taylor edited comment on PHOENIX-4666 at 3/22/18 3:20 AM:


I like the simplicity of your design, [~ortutay], in using the hash as the 
cache ID. You could just assume that the cache is already available and react 
to the exception you get back by generating the cache if it's not. That way 
you'd need no mapping at all (and no central place to check if the cache ID 
maps to an existing cache). That flow is already there, but you'd need to add 
the logic to generate the cache as the current code assumes that it has already 
built the cache (i.e. this handles the situation in which a region splits and 
the new RS doesn't have the cache yet).

Some considerations:
- at a minimum, we could have a global config for the TTL of the cache when 
this feature is enabled (so that it'd be a different config than the standard 
TTL config).
- at the finest granularity, you could even create a new hint that specifies 
the TTL so you could specify it per query.
- we'd want to make it clear in documentation that the cache data would be 
stale once generated (until the TTL expires it).
- might consider having a new table level property on which this feature could 
be enabled (or a table-specific TTL could be specified)
- might consider in the future using a format like Apache Arrow to represent 
the hash join cache data
- might consider off heap memory for hash join cache
- persistent cache could be future work (or you could put interfaces in place 
that could be replaced)


was (Author: jamestaylor):
I like the simplicity of your design, [~ortutay], in using the hash as the 
cache ID. You could just assume that the cache is already available and react 
to the exception you get back by generating the cache if it's not. That way 
you'd need no mapping at all (and no central place to check if the cache ID 
maps to an existing cache). That flow is already there, but you'd need to add 
the logic to generate the cache as the current code assumes that it has already 
built the cache (i.e. this handles the situation in which a region splits and 
the new RS doesn't have the cache yet).

Some considerations:
- at a minimum, we could have a global config for the TTL of the cache when 
this feature is enabled (so that it'd be a different config than the standard 
TTL config).
- at the finest granularity, you could even create a new hint that specifies 
the TTL so you could specify it per query.
- we'd want to make it clear in documentation that the cache data would be 
stale once generated (until the TTL expires it).
- might consider having a new table level property on which this feature could 
be enabled (or a table-specific TTL could be specified)

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira 

[jira] [Comment Edited] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

2018-03-21 Thread Marcell Ortutay (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16408720#comment-16408720
 ] 

Marcell Ortutay edited comment on PHOENIX-4666 at 3/21/18 10:46 PM:


Thanks for the input [~maryannxue]. My current implementation is here: 
[https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in 
particular this is my hacky patch: 
[https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af]
 . (Implemented on 4.7 for unrelated reasons, but the same idea I think is 
transferable to HEAD) Instead of a random cache ID it takes a hash of the query 
statement and uses that as the cache ID. Each Phoenix client maintains it's own 
memory of which cache IDs have already been executed (this is not ideal, but it 
was easy to implement this way).

If I'm understanding your proposal, the Phoenix client would attempt to use a 
cache ID with the expectation that it exists on region servers. The region 
server would throw an exception if the cache ID is not found, which indicates 
to Phoenix client that it should evaluate the subquery as usual.


was (Author: ortutay):
Thanks for the input [~maryannxue]. My current implementation is here: 
[https://github.com/ortutay23andme/phoenix/tree/4.7.0-HBase-1.1] and in 
particular this is my hacky patch: 
[https://github.com/ortutay23andme/phoenix/commit/04c96f672eb4bcdccec27f124373be766f8dd5af]
 . Instead of a random cache ID it takes a hash of the query statement and uses 
that as the cache ID. Each Phoenix client maintains it's own memory of which 
cache IDs have already been executed (this is not ideal, but it was easy to 
implement this way).

If I'm understanding your proposal, the Phoenix client would attempt to use a 
cache ID with the expectation that it exists on region servers. The region 
server would throw an exception if the cache ID is not found, which indicates 
to Phoenix client that it should evaluate the subquery as usual.

> Add a subquery cache that persists beyond the life of a query
> -
>
> Key: PHOENIX-4666
> URL: https://issues.apache.org/jira/browse/PHOENIX-4666
> Project: Phoenix
>  Issue Type: Improvement
>Reporter: Marcell Ortutay
>Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> 
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)