[ 
https://issues.apache.org/jira/browse/IMPALA-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720802#comment-16720802
 ] 

Paul Rogers commented on IMPALA-7534:
-------------------------------------

As it turns out, it is not the Guava cache that has the race condition, it is 
our code. We do not use the "loading" feature of the cache. Instead, we seem to 
use the unprotected get/check/put pattern. The correct use case is illustrated 
in the [Drill 
project}https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/compile/CodeCompiler.java#L145]
 where the cache has provided rock-solid highly multi-threaded service as 
Drill's generated code cache.

To avoid the race condition, we need to use a loader. See the [Guava 
docs|https://github.com/google/guava/wiki/CachesExplained#from-a-cacheloader] 
for details.

Without a loader, there is no way for the Guava cache to ensure concurrency 
control since the get/check/put operations occur in our code, not Guava's. With 
a cache loader, Guava handles locking: the first reader loads the value, any 
concurrent readers wait for the first reader to complete.

Error handling is built-in: if the first reader encounters an error, the slot 
is not populated. The second reader tries again.

Likely we have our own concurrency control outside of the Guava cache. Perhaps 
we can eliminate that, if we have global lock, and rely on the Guava cache.

Looks like adding this to our code will be a non-trivial exercise. We'll want 
to provide detailed concurrency tests since, if we didn't notice the problem 
before now, we likely don't have these tests.

Will do more research and propose a concrete set of changes to make full use of 
the Guava loading cache, or provide an explanation why we can't.

> Handle invalidation races in CatalogdMetaProvider cache
> -------------------------------------------------------
>
>                 Key: IMPALA-7534
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7534
>             Project: IMPALA
>          Issue Type: Sub-task
>            Reporter: Todd Lipcon
>            Assignee: Paul Rogers
>            Priority: Major
>
> There is a well-known race in Guava's LoadingCache that we are using for 
> CatalogdMetaProvider which we are not currently handling:
> - thread 1 gets a cache miss and makes a request to fetch some data from the 
> catalogd. It fetches the catalog object with version 1 and then gets context 
> switched out or otherwise slow
> - thread 2 receives an invalidation for the same object, because it has 
> changed to v2. It calls 'invalidate' on the cache, but nothing is yet cached.
> - thread 1 puts back v1 of the object into the cache
> In essence we've "missed" an invalidation. This is also described in this 
> nice post: https://softwaremill.com/race-condition-cache-guava-caffeine/
> The race is quite unlikely but could cause some unexpected results that are 
> hard to reason about, so we should look into a fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to