nozjkoitop opened a new pull request, #19548:
URL: https://github.com/apache/druid/pull/19548

   ### Description
   
   #### Fixed a race condition in off-heap cached lookups
   
   This PR fixes a race between lookup cache refresh and concurrent query 
execution for cached-global lookups, using off-heap mmap-backed storage.
   
   Previously, query paths could obtain a `LookupExtractor` backed by the 
current cache version while a lookup refresh concurrently replaced and deleted 
that cache version. In off-heap mode, this could leave in-flight query work 
referencing disposed mmap-backed data.
   
   This PR adds lookup cache retirement and retained lookup extractors. When a 
cache refresh replaces a lookup cache version, the old version is retired 
instead of closed immediately. Retired cache versions remain alive while 
retained references exist, and are disposed once those references are released 
or after a configured timeout.
   
   #### Added retained lookup extractor lifecycle support
   
   `LookupExtractorFactory` now has an optional 
`acquireRetainedLookupExtractor()` method. Factories that support retained 
backing resources can return a `RetainedLookupExtractor`, which wraps the 
actual `LookupExtractor` and closes the retained backing reference when query 
work is finished.
   
   `RetainedLookupExtractor` supports explicit `close()` for query paths with a 
lifecycle hook, and also uses a `Cleaner` fallback for paths that hand lookup 
extractors to APIs without an explicit close hook.
   
   Retained lookup extractors are now used across lookup query paths, including:
   
   - registered lookup extraction functions
   - lookup dimension specs
   - lookup joins
   - lookup segment iteration
   
   #### Added cache retirement to cached-global lookups
   
   `CacheScheduler.VersionedCache` now tracks cache lifecycle states: live, 
retired, and disposed. Cache versions can be retained by active query work. 
When a refresh swaps in a new cache version, the previous version is retired 
and tracked until it can be safely disposed.
   
   To avoid unbounded retention, refreshes are skipped if the number of 
retained retired cache versions reaches the configured limit. Retired cache 
entries may also be forcibly disposed after a timeout to prevent abandoned 
references from blocking future refreshes indefinitely.
   
   Two new cached-global lookup configuration properties were added:
   
   | Property | Description | Default |
   |---|---|---:|
   | `druid.lookup.namespace.maxRetiredCacheEntries` | The maximum number of 
retired lookup cache versions to keep while they are still retained by 
in-flight query work. Further cache refreshes are skipped until the retired 
cache versions are released or time out. | `1` |
   | `druid.lookup.namespace.retiredCacheEntryTimeoutMillis` | The amount of 
time after retirement before a retained cache version may be disposed even if 
its retained references have not been closed yet. This prevents abandoned 
retained references from blocking future cache refreshes indefinitely. | 
`900,000` |
   
   #### Release note
   
   Cached-global lookups now retain retired cache versions while in-flight 
queries are still using them, preventing queries from referencing disposed 
off-heap lookup cache data during concurrent lookup refreshes.
   
   Two new configuration properties control this behavior: 
`druid.lookup.namespace.maxRetiredCacheEntries` and 
`druid.lookup.namespace.retiredCacheEntryTimeoutMillis`.
   
   <hr>
   
   ##### Key changed/added classes in this PR
   
    * `RetainedLookupExtractor`
    * `LookupExtractorFactory`
    * `NamespaceLookupExtractorFactory`
    * `CacheScheduler`
    * `NamespaceExtractionConfig`
    * `LookupDimensionSpec`
    * `RegisteredLookupExtractionFn`
    * `LookupSegment`
    * `LookupJoinable`
    * `LookupJoinableFactory`
   
   <hr>
   
   This PR has:
   
   - [x] been self-reviewed.
      - [x] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [x] added documentation for new or modified features or behaviors.
   - [x] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to