[ 
https://issues.apache.org/jira/browse/PHOENIX-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818484#comment-16818484
 ] 

John Phillips commented on PHOENIX-5239:
----------------------------------------

[~lhofhansl] Thanks for the feedback. I'm curious what you mean by "if there're 
many others this might be a problem"? Is your concern that potentially adding 
the caches to regionservers where it's not necessary will take up too much 
memory, that distributing it to all the regionservers will be a lot of IO, or 
something else?

Also, could you elaborate on:
{quote}
Despite what PHOENIX-4666 says that application could very well execute any 
inner query first and then produce an IN query with a semi-join
{quote}

Are you referring to doing some sort of sort of caching at the application 
layer along the lines of:
{code}
// do this once to cache the IDs
List<Id> expensiveToCalculateIds = query("SELECT id_2 FROM large_table WHERE x 
= 10");
// call queries like this many times
query("SELECT * FROM table1 WHERE id_1 IN ($expensiveToCalculateIds) AND …");
{code}
Because you are correct in that should work for the trivial example query. 
However, unfortunately I don't think this would be feasible with our production 
queries as they are significantly more complex. For one, the outer queries need 
to select + aggregate on multiple columns from the results of inner query which 
I believe would make a solution like this impossible. Additionally, some of the 
subqueries are quite large (a few million rows), and making repeated queries 
with an {{IN}} clause that large would likely add a lot of IO overhead.

While we're open to looking into other ideas to make these queries performant, 
for what it's worth, the persistent subquery caching has worked well for us so 
far (notwithstanding this issue).

> Send persistent subquery cache to all regionservers
> ---------------------------------------------------
>
>                 Key: PHOENIX-5239
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5239
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: John Phillips
>            Priority: Major
>
> PHOENIX-4666 introduced a persistent subquery cache that allowed phoenix to 
> cache the results from an expensive subquery (enabled with a 
> {{USE_PERSISTENT_CACHE}} query hint) to speed up subsequent queries.
> More context is available on the PHOENIX-4666 ticket, but a quick example 
> would be a query like:
> {code:java}
> SELECT /*+ USE_PERSISTENT_CACHE */ *
>     FROM table1
>     JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result
>     ON table1.id_1 = expensive_result.id_2
> WHERE table1.id_1 = [some_id]
> {code}
> Where lots of queries are ran, differing only by {{some_id}}. Our usage 
> involves first running one query over phoenix to warm the cache (which takes 
> ~20 seconds), then once complete, allowing the live query to run which 
> utilize the persistent subquery cache (~100ms).
> However, we noticed that when phoenix sends the cache to the regionservers, 
> it looks at {{some_id}} in the outer query to figure out which regionservers 
> might contain {{table1.id_1 = [some_id]}} ([code 
> here|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]).
>  This means that when we first start running the query, we'll inconsistently 
> hit the cache until it ends up being propagated to all the regionservers.
> Basically, we'd like to have some way to warm the subquery cache and ensure 
> it's on all the regionservers so subsequent queries will always find the 
> cache. I think the simplest solution might be updating the [if statement in 
> ServerCacheClient#addServerCache|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]
>  to simply always send the cache to all the regionservers if it's a 
> persistent subquery:
> {code:java}
> - if ( ! servers.contains(entry) &&
> -         keyRanges.intersectRegion(regionStartKey, regionEndKey,
> -                 cacheUsingTable.getIndexType() == IndexType.LOCAL)) {
> + boolean keyRangesIntersect = keyRanges.intersectRegion(regionStartKey, 
> regionEndKey,
> +         cacheUsingTable.getIndexType() == IndexType.LOCAL);
> + if (!servers.contains(entry) && (keyRangesIntersect || usePersistentCache)) 
> {
> {code}
> I tested this out, and it seems to work as expected. If it sounds like an 
> acceptable solution, I'd be happy to make an actual PR. Or, if anyone has any 
> other suggestions on better ways to handle this, it would be much appreciated.
> FYI [~jamestaylor], [~elserj], and [~maryannxue] since it looks like you 
> three handled most of the review on the [original persistent cache 
> PR|https://github.com/apache/phoenix/pull/298]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to