Marcell Ortutay commented on PHOENIX-4666:

A update on this: I've implemented a basic version of this that re-uses the RHS 
results in a subquery cache. I made a few changes to the original hacky 
implementation that I wanted to get some feedback on.

My code is here: 
[https://github.com/ortutay/phoenix/tree/PHOENIX-4666-subquery-cache] ; please 
note this is a work in progress.

I've changed the following things:
 # In my first implementation, I stored a mapping of subquery hash -> 
ServerCache client side. This works in the single client use case but doesn't 
work if you have a cluster of PQS servers (which is our situation at 23andMe). 
So instead I replaced this with an RPC mechanism. The client will send an RPC 
to each region server, and check if the subquery results are available.
 # Originally I planned to only return a boolean in the RPC check. However, I 
ran into an issue. It turns out that the serialize() method is involved in the 
generation of key ranges that are used in the query [1]. This serialize() 
method is in the addHashCache() code path. In order to make sure this code is 
hit, I am creating a CachedSubqueryResultIterator which is passed to the 
addHashCache() code path. This ensures that all side effects, like the key 
range generation, are the same between cached / uncached code paths.

Would love to get feedback on this approach. For (2) there is an alternate 
approach that also caches the key ranges. This is more efficient but has the 
downside of needing specialized code.

Work still left to do is eviction logic, and hint to enable, and general 


> Add a subquery cache that persists beyond the life of a query
> -------------------------------------------------------------
>                 Key: PHOENIX-4666
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4666
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Marcell Ortutay
>            Assignee: Marcell Ortutay
>            Priority: Major
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> ----
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.

This message was sent by Atlassian JIRA

Reply via email to