[ 
https://issues.apache.org/jira/browse/PHOENIX-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16816752#comment-16816752
 ] 

John Phillips commented on PHOENIX-5239:
----------------------------------------

[~lhofhansl] in that scenario, the proposed behavior here is that it would send 
the subquery result to all 1000 regionservers to be cached. Also, just to note 
this would just be for persistently cached subqueries and regular subqueries 
would continue to send the cache to only the 10 relevant servers for the query.

Additionally, after more thought, I think another benefit of the proposed 
change would be preventing persistent cache inconsistencies. Using the example 
query in the ticket with {{table1}} and {{large_table}}, let's say the primary 
key of {{table1}} is {{id_1}}, and it's split into 2 regions:

* RegionServer1 has Region1 with {{id_1}} range {{[0-9]}}
* RegionServer2 has Region2 with {{id_1}} range {{[10-19]}}

Then, if following queries occurred serially (same example query, just 
substituting out the where clause in the outer query), I believe the current 
code would produce this behavior:

* {{WHERE table1.id_1 = 5}} -> involves RS1 which doesn't contain the cache. 
Computes subquery of {{large_table}} and caches on RS1
* {{WHERE table1.id_1 = 6}} -> involves RS1 which now does have the cache. Uses 
cache to compute results on RS1
* Data is added, removed, or updated in {{large_table}}
* {{WHERE table1.id_1 = 12}} -> involves RS2 which doesn't contain the cache. 
Computes subquery of {{large_table}} and caches on RS2
* {{WHERE table1.id_1 IN (7, 17)}} -> involves RS1 and RS2 which both have the 
cache. {{id_1 = 7}} result joins against the now out-of-date cache on RS1 and 
{{id_1 = 17}} results join against the currently up-to-date cache on RS2

While it's expected that using the subquery cache could return stale data, I 
would consider it very unexpected that it could produce inconsistently stale 
data depending on the row

> Send persistent subquery cache to all regionservers
> ---------------------------------------------------
>
>                 Key: PHOENIX-5239
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5239
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: John Phillips
>            Priority: Major
>
> PHOENIX-4666 introduced a persistent subquery cache that allowed phoenix to 
> cache the results from an expensive subquery (enabled with a 
> {{USE_PERSISTENT_CACHE}} query hint) to speed up subsequent queries.
> More context is available on the PHOENIX-4666 ticket, but a quick example 
> would be a query like:
> {code:java}
> SELECT /*+ USE_PERSISTENT_CACHE */ *
>     FROM table1
>     JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result
>     ON table1.id_1 = expensive_result.id_2
> WHERE table1.id_1 = [some_id]
> {code}
> Where lots of queries are ran, differing only by {{some_id}}. Our usage 
> involves first running one query over phoenix to warm the cache (which takes 
> ~20 seconds), then once complete, allowing the live query to run which 
> utilize the persistent subquery cache (~100ms).
> However, we noticed that when phoenix sends the cache to the regionservers, 
> it looks at {{some_id}} in the outer query to figure out which regionservers 
> might contain {{table1.id_1 = [some_id]}} ([code 
> here|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]).
>  This means that when we first start running the query, we'll inconsistently 
> hit the cache until it ends up being propagated to all the regionservers.
> Basically, we'd like to have some way to warm the subquery cache and ensure 
> it's on all the regionservers so subsequent queries will always find the 
> cache. I think the simplest solution might be updating the [if statement in 
> ServerCacheClient#addServerCache|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]
>  to simply always send the cache to all the regionservers if it's a 
> persistent subquery:
> {code:java}
> - if ( ! servers.contains(entry) &&
> -         keyRanges.intersectRegion(regionStartKey, regionEndKey,
> -                 cacheUsingTable.getIndexType() == IndexType.LOCAL)) {
> + boolean keyRangesIntersect = keyRanges.intersectRegion(regionStartKey, 
> regionEndKey,
> +         cacheUsingTable.getIndexType() == IndexType.LOCAL);
> + if (!servers.contains(entry) && (keyRangesIntersect || usePersistentCache)) 
> {
> {code}
> I tested this out, and it seems to work as expected. If it sounds like an 
> acceptable solution, I'd be happy to make an actual PR. Or, if anyone has any 
> other suggestions on better ways to handle this, it would be much appreciated.
> FYI [~jamestaylor], [~elserj], and [~maryannxue] since it looks like you 
> three handled most of the review on the [original persistent cache 
> PR|https://github.com/apache/phoenix/pull/298]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to