[ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16412221#comment-16412221
 ] 

Marcell Ortutay commented on PHOENIX-4666:
------------------------------------------

Thanks for the input [~jamestaylor]. I'm thinking the first pass can be fairly 
simple, and it can be expanded in follow-up patches. To start, here is what I 
would propose:
 # Use existing server cache, with option to keep around data past a single 
query. There would be a new "keep around TTL" that sets the max time an entry 
is kept around. The inter-query data may be evicted if space is needed. Data 
being used for a "live" query is track as such, and is never evicted (keep 
current Exception behavior)
 # Subquery cache is triggered with a /*+ SUBQUERY_CACHE */ hint, and is only 
activated if this hint is present. This hint also has an optional cache key 
suffix, eg.: /*+ SUBQUERY_CACHE('2018-03-23') */ which can be used by the 
application to explicitly expire a cache, in case TTL does not give enough 
control
 # Cache eviction uses some sort of priority queue / LRU type system. Simple 
ranking could be Rank = # of Cache Hits in Last X minutes / Size of the Entry

Things that will be left for future work:
 # Additional config/control around when to use subquery cache, eg. global 
control, or a table level control, or table timestamp based controls
 # Use of Apache Arrow for serialization (instead of existing 
HashCacheClient.serialize() method)
 # Persistent cache separate from HBase coprocessor system

I'm going to start work on this next week, and hopefully will have a patch by 
end of the week for initial review

> Add a subquery cache that persists beyond the life of a query
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4666
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4666
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Marcell Ortutay
>            Assignee: Marcell Ortutay
>            Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> ----
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to