[
https://issues.apache.org/jira/browse/PHOENIX-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831771#comment-16831771
]
John Phillips commented on PHOENIX-5239:
----------------------------------------
[~lhofhansl] Thanks for getting back. I'm definitely open to updating the code
to add a flag, but based on your reply, I just want to first make sure we're on
the same page as I realized I may not have done a great job of explaining
certain aspects of this change.
bq. we need to consider the general case
The PR as written does not change the behavior of the "general case". This
_only_ applies to {{SELECT}} queries where the "{{/*+ USE_PERSISTENT_CACHE
*/}}" hint is explicitly included. Without that hint, the behavior of
subqueries would be unchanged (they will only be distributed to the
regionservers that are strictly required for the query).
bq. We cannot default to shipping to shipping a subquery to 1000 or more server
only because they may be later be involved in a query
The use case of the {{USE_PERSISTENT_CACHE}} hint is when people plan on using
the subquery in later queries and can tolerate stale data. If someone is using
the hint for a bunch of one-off queries, that's a usage mistake on their end
which I'm not convinced we necessarily need to do anything to prevent. Someone
could just as easily write the outer query join condition on some non-leading
part of the primary key, which would have the same effect of distributing the
subquery to all the regionservers.
Last, the caching is done inside of tenant specific caches
({{TenantCacheImpl}}), which should help isolate some of the memory impact in a
multi-tenant environment.
But if all that makes sense, and you still feel there should be a flag for
this, I'll go ahead and update the PR.
> Send persistent subquery cache to all regionservers
> ---------------------------------------------------
>
> Key: PHOENIX-5239
> URL: https://issues.apache.org/jira/browse/PHOENIX-5239
> Project: Phoenix
> Issue Type: Improvement
> Reporter: John Phillips
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> PHOENIX-4666 introduced a persistent subquery cache that allowed phoenix to
> cache the results from an expensive subquery (enabled with a
> {{USE_PERSISTENT_CACHE}} query hint) to speed up subsequent queries.
> More context is available on the PHOENIX-4666 ticket, but a quick example
> would be a query like:
> {code:java}
> SELECT /*+ USE_PERSISTENT_CACHE */ *
> FROM table1
> JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result
> ON table1.id_1 = expensive_result.id_2
> WHERE table1.id_1 = [some_id]
> {code}
> Where lots of queries are ran, differing only by {{some_id}}. Our usage
> involves first running one query over phoenix to warm the cache (which takes
> ~20 seconds), then once complete, allowing the live query to run which
> utilize the persistent subquery cache (~100ms).
> However, we noticed that when phoenix sends the cache to the regionservers,
> it looks at {{some_id}} in the outer query to figure out which regionservers
> might contain {{table1.id_1 = [some_id]}} ([code
> here|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]).
> This means that when we first start running the query, we'll inconsistently
> hit the cache until it ends up being propagated to all the regionservers.
> Basically, we'd like to have some way to warm the subquery cache and ensure
> it's on all the regionservers so subsequent queries will always find the
> cache. I think the simplest solution might be updating the [if statement in
> ServerCacheClient#addServerCache|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]
> to simply always send the cache to all the regionservers if it's a
> persistent subquery:
> {code:java}
> - if ( ! servers.contains(entry) &&
> - keyRanges.intersectRegion(regionStartKey, regionEndKey,
> - cacheUsingTable.getIndexType() == IndexType.LOCAL)) {
> + boolean keyRangesIntersect = keyRanges.intersectRegion(regionStartKey,
> regionEndKey,
> + cacheUsingTable.getIndexType() == IndexType.LOCAL);
> + if (!servers.contains(entry) && (keyRangesIntersect || usePersistentCache))
> {
> {code}
> I tested this out, and it seems to work as expected. If it sounds like an
> acceptable solution, I'd be happy to make an actual PR. Or, if anyone has any
> other suggestions on better ways to handle this, it would be much appreciated.
> FYI [~jamestaylor], [~elserj], and [~maryannxue] since it looks like you
> three handled most of the review on the [original persistent cache
> PR|https://github.com/apache/phoenix/pull/298]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)