smengcl opened a new pull request, #4567: URL: https://github.com/apache/ozone/pull/4567
## What changes were proposed in this pull request? This is approach 1 of 2 that could fix the issue that bothers SnapDiff, where the current `LoadingCache` behaves like a simple LRU cache. We had no control over when an `OmSnapshot` instance can be evicted and closed, which can cause the the snapshot DB instance to be **closed prematurely** while SnapDiff is still running in the background, crashing the OM. Opening this as a draft for suggestions. @GeorgeJahad @hemantk-12 could also take a look at this. I'd like to give some opinions and feedbacks on the overall approach. The core of this PR is `SnapshotCache` class and `RefCount` interface. One can start from there. `SnapshotCache` is supposed to be thread-safe. With this approach, every single time an `OmSnapshot` instance is retrieved using `snapshotCache.get()`, a corresponding `snapshotCache.release()` has to be called at the end of its usage, otherwise it causes leakage because the reference count is not decremented. I have identified all existing usages of `snapshotCache.get()`, wrapped in another helper method or not, on latest master branch and added the logic (or TODO) there on how they should be properly handled. One can very, very easily forget to release the handle (as we have seen with all the RDB handle leaks Duong has been painstakingly dealing with over the past few months), so I don't think this approach is the cleanest. However, putting the decrement ref count logic in `OmSnapshot#close` (so that we can wrap `OmSnapshot` inside a try-with-resources) would also break the encapsulation thus considered hacky by myself. I put sanity checks everywhere just to be defensive. In the meantime, I will explore if [`CacheBuilder#weakValues()`](https://www.javadoc.io/doc/com.google.guava/guava/latest/com/google/common/cache/CacheBuilder.html#weakValues()) would be a viable alternative (potential approach 2), where we could probably rely on JVM itself to do the "reference counting" on the instances. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-7935 ## How was this patch tested? - [ ] Unit test to be finished. - [ ] Tests that covers existing snapshot cache use-cases, that is all snapshot operations that read or write snapshot DB, should all pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
