[
https://issues.apache.org/jira/browse/PHOENIX-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832705#comment-16832705
]
Josh Elser commented on PHOENIX-5239:
-------------------------------------
{quote}The problem is that leads to unpredictability of response time.
{quote}
Gotcha. So, your persistent query is less effective because you still have
variance in 'deploying' the results to the RS.
{quote}
I would see this as the lesser of two evils when it's anticipated the subquery
will end up being used on most of the regionservers. What's your opinion of
adding a config option to toggle this behavior?
{quote}
I think I lean towards Lars' opinions as well – I don't like it, but I won't
veto the change. A hint which indicates "cache on all servers" (instead of
"cache on necessary servers") strikes me as the best middle-ground. People have
to opt-in to it, but it's still usable for you without much pain. However, I
would caution that this does _not_ work for multi-tenant installations. If you
and I are each trying to cache a query on all RS which fill the available
cache, we'll just stomp on each other.
I believe my bigger fear is that this is just one of more features in which you
try to make RegionServers look like a distributed memory caching layer.
RegionServers are definitely not built for such a thing (Java isn't great at
keeping large hunks of memory resident). If this is something long-term you
want to use, we may be better off trying to use a Redis or Memcached instead of
keeping it in the RegionServer. Having a distributed filesystem behind HBase is
something we can use, although, for large numbers of RegionServers, we might
ourselves in a case where we have a thundering-herd going to a single DataNode
(when the blocks aren't yet replicated to multiple DataNodes). In short, I
think there's likely a better, long-term architecture choice to find, but it
would require some experimentation to see what would be best (in both
simplicity and effeciency) :). Good follow-on thoughts.
> Send persistent subquery cache to all regionservers
> ---------------------------------------------------
>
> Key: PHOENIX-5239
> URL: https://issues.apache.org/jira/browse/PHOENIX-5239
> Project: Phoenix
> Issue Type: Improvement
> Reporter: John Phillips
> Priority: Major
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> PHOENIX-4666 introduced a persistent subquery cache that allowed phoenix to
> cache the results from an expensive subquery (enabled with a
> {{USE_PERSISTENT_CACHE}} query hint) to speed up subsequent queries.
> More context is available on the PHOENIX-4666 ticket, but a quick example
> would be a query like:
> {code:java}
> SELECT /*+ USE_PERSISTENT_CACHE */ *
> FROM table1
> JOIN (SELECT id_1 FROM large_table WHERE x = 10) expensive_result
> ON table1.id_1 = expensive_result.id_2
> WHERE table1.id_1 = [some_id]
> {code}
> Where lots of queries are ran, differing only by {{some_id}}. Our usage
> involves first running one query over phoenix to warm the cache (which takes
> ~20 seconds), then once complete, allowing the live query to run which
> utilize the persistent subquery cache (~100ms).
> However, we noticed that when phoenix sends the cache to the regionservers,
> it looks at {{some_id}} in the outer query to figure out which regionservers
> might contain {{table1.id_1 = [some_id]}} ([code
> here|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]).
> This means that when we first start running the query, we'll inconsistently
> hit the cache until it ends up being propagated to all the regionservers.
> Basically, we'd like to have some way to warm the subquery cache and ensure
> it's on all the regionservers so subsequent queries will always find the
> cache. I think the simplest solution might be updating the [if statement in
> ServerCacheClient#addServerCache|https://github.com/apache/phoenix/blob/2084a6c/phoenix-core/src/main/java/org/apache/phoenix/cache/ServerCacheClient.java#L282-L283]
> to simply always send the cache to all the regionservers if it's a
> persistent subquery:
> {code:java}
> - if ( ! servers.contains(entry) &&
> - keyRanges.intersectRegion(regionStartKey, regionEndKey,
> - cacheUsingTable.getIndexType() == IndexType.LOCAL)) {
> + boolean keyRangesIntersect = keyRanges.intersectRegion(regionStartKey,
> regionEndKey,
> + cacheUsingTable.getIndexType() == IndexType.LOCAL);
> + if (!servers.contains(entry) && (keyRangesIntersect || usePersistentCache))
> {
> {code}
> I tested this out, and it seems to work as expected. If it sounds like an
> acceptable solution, I'd be happy to make an actual PR. Or, if anyone has any
> other suggestions on better ways to handle this, it would be much appreciated.
> FYI [~jamestaylor], [~elserj], and [~maryannxue] since it looks like you
> three handled most of the review on the [original persistent cache
> PR|https://github.com/apache/phoenix/pull/298]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)