[
https://issues.apache.org/jira/browse/IMPALA-7534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16866066#comment-16866066
]
Todd Lipcon commented on IMPALA-7534:
-------------------------------------
Reading back over Paul's analysis here, I think the missing link is that the
version-numbered cache keys are used for individual objects, but not the higher
levels in the hierarchy (like table name list and the top-level table object).
So, this can cause issues like IMPALA-8567 as described above. Assuming a
starting state where the table name list is not cached:
- Impalad: some select query, which calls loadTableNames(), and sends a request
to the catlaog
- Catalog: returns a list of tables ['foo'], but the response is still in-flight
- Catalog: someone issues a DDL which creates a table 'bar'. Issues an
invalidate to all impalads
- Impalad: the loadTableNames() call is still in flight, but receives the
invalidation via a different thread. The invalidation sees nothing is in the
cache, so it is ignored.
- Impalad: the loadTableNames() query completes, and the table list ['foo'] is
cached
This leaves the impalad cache in a persistent incorrect state. New calls to
loadTableNames() get a cache hit with the incorrect value.
In order to fix this, as discussed in the linked articles, we have a few
choices:
(1) invalidate can block on any outstanding "loadWithCaching" for the same key,
and invalidate it after it gets stored in the cache
(2) invalidate can prevent any outstanding "loadWithCaching" from writing back
its result
Choice 2 is better to avoid blocking between potentially-unrelated operations.
> Handle invalidation races in CatalogdMetaProvider cache
> -------------------------------------------------------
>
> Key: IMPALA-7534
> URL: https://issues.apache.org/jira/browse/IMPALA-7534
> Project: IMPALA
> Issue Type: Sub-task
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Major
> Fix For: Not Applicable
>
>
> There is a well-known race in Guava's LoadingCache that we are using for
> CatalogdMetaProvider which we are not currently handling:
> - thread 1 gets a cache miss and makes a request to fetch some data from the
> catalogd. It fetches the catalog object with version 1 and then gets context
> switched out or otherwise slow
> - thread 2 receives an invalidation for the same object, because it has
> changed to v2. It calls 'invalidate' on the cache, but nothing is yet cached.
> - thread 1 puts back v1 of the object into the cache
> In essence we've "missed" an invalidation. This is also described in this
> nice post: https://softwaremill.com/race-condition-cache-guava-caffeine/
> The race is quite unlikely but could cause some unexpected results that are
> hard to reason about, so we should look into a fix.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]