[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

Marcell Ortutay (JIRA) Fri, 13 Apr 2018 12:05:38 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-4666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437759#comment-16437759
 ]


Marcell Ortutay commented on PHOENIX-4666:
------------------------------------------

Thanks for taking a look [~maryannxue]. responses below:

> 1. First of all, I think it's important that we have an option to enable and 
>disable the persistent cache, making sure that users can still run join 
>queries in the default temp-cache way.

Yes definitely. In fact I am adding a hint, and for now I think it makes sense 
to only enable it if that hint is there, so we don't break any existing 
behavior.

> 2. Regarding to your change [2], can you explain what exactly is the problem 
>of key-range generation? Looks like checkCache() and addCache() are doing 
>redundant work, and CachedSubqueryResultIterator should be unnecessary. We do 
>not wish to read the cache on the client side and then re-add the cache again.

Yes in my first attempt at this I did not have the redundant work, but I ran 
into a bug where I was getting empty results when using the cached code path. 
If you look at ServerCache.addHashCache() you'll notice that it calls 
serialize(): 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L85]

serialize() produces that serialized RHS join cache and iterates over all the 
results in the ResultIterator for the RHS query. In this line 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/join/HashCacheClient.java#L131]
 it adds entries to keyRangeRhsValues. This is a list of the key values on the 
RHS of the join. It is used here 
[https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/execute/HashJoinPlan.java#L226]
 in HashJoinPlan as part of the query, and at some point it becomes an argument 
to set the scan range (I can dig up where if you'd like).

For this reason the cached code path somehow needs to generate correct values 
for keyRangeRhsValues, or correct values for the scan range, and these values 
need to be available on the client side.

The approach I did just re-runs the same codepath for both no-cache and cached 
queries. The advantage is it was fairly simple to implement and it guarantees 
identical execution. The downside is the redundant work. It would also be 
possible to have special case code to set the scan range for cached queries. 
This is a bit harder to implement but is more efficient.

Happy to hear what people think about this. Maybe there is something much 
simpler that I am missing!

> 3. We need to be aware that the string representation of the sub-query 
>statement is not reliable, which means the same join-tables or sub-queries do 
>not necessarily map to the same string representation, and thus will have 
>different generated cache-id. It'd be optimal if we can have some 
>normalization here. We can consider leaving this as a future improvement, yet 
>at this point we'd better have some test cases (counter cases as well) to 
>cover this point.

Yes, definitely. I'd prefer to leave this as a future improvement to keep the 
initial PR focused. IIRC I saw for some complex queries there is a "$1" or "$2" 
placeholder, which changes even across identical queries. There are probably 
more things like this, eg. "x=10 AND y=20" is the same as "y=20 AND x=10".

> 4. Is there a way for us to update the cache content if tables have been 
>updated? This might be related to what approach we take to add and re-validate 
>cache in (2).

Currently, no. I was thinking though that the user application can control 
invalidation using a hint, like this:

    /*+ CACHE_PERSISTENT('2018-01-01 12:00') */

The '2018-01-01 12:00' would be a suffix to whatever cacheId we generate, like 
this:

    cacheId = hash(cacheId + '2018-01-01 12:00')

which lets the application explicitly invalidate the cache when needed.

> 5. A rather minor point as it just occurred to me: Can we have CacheEntry 
>implement Closable?

Yes. Just so I know, what is the benefit of this?

And yes, apologies for the messy code. I'm fixing it up today and it should be 
ready for a more thorough review today or tomorrow.

> Add a subquery cache that persists beyond the life of a query
> -------------------------------------------------------------
>
>                 Key: PHOENIX-4666
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4666
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Marcell Ortutay
>            Assignee: Marcell Ortutay
>            Priority: Major
>
> The user list thread for additional context is here: 
> [https://lists.apache.org/thread.html/e62a6f5d79bdf7cd238ea79aed8886816d21224d12b0f1fe9b6bb075@%3Cuser.phoenix.apache.org%3E]
> ----
> A Phoenix query may contain expensive subqueries, and moreover those 
> expensive subqueries may be used across multiple different queries. While 
> whole result caching is possible at the application level, it is not possible 
> to cache subresults in the application. This can cause bad performance for 
> queries in which the subquery is the most expensive part of the query, and 
> the application is powerless to do anything at the query level. It would be 
> good if Phoenix provided a way to cache subquery results, as it would provide 
> a significant performance gain.
> An illustrative example:
>     SELECT * FROM table1 JOIN (SELECT id_1 FROM large_table WHERE x = 10) 
> expensive_result ON table1.id_1 = expensive_result.id_2 AND table1.id_1 = 
> \{id}
> In this case, the subquery "expensive_result" is expensive to compute, but it 
> doesn't change between queries. The rest of the query does because of the 
> \{id} parameter. This means the application can't cache it, but it would be 
> good if there was a way to cache expensive_result.
> Note that there is currently a coprocessor based "server cache", but the data 
> in this "cache" is not persisted across queries. It is deleted after a TTL 
> expires (30sec by default), or when the query completes.
> This is issue is fairly high priority for us at 23andMe and we'd be happy to 
> provide a patch with some guidance from Phoenix maintainers. We are currently 
> putting together a design document for a solution, and we'll post it to this 
> Jira ticket for review in a few days.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PHOENIX-4666) Add a subquery cache that persists beyond the life of a query

Reply via email to