smengcl opened a new pull request, #4567:
URL: https://github.com/apache/ozone/pull/4567

   ## What changes were proposed in this pull request?
   
   This is approach 1 of 2 that could fix the issue that bothers SnapDiff, 
where the current `LoadingCache` behaves like a simple LRU cache. We had no 
control over when an `OmSnapshot` instance can be evicted and closed, which can 
cause the the snapshot DB instance to be **closed prematurely** while SnapDiff 
is still running in the background, crashing the OM.
   
   Opening this as a draft for suggestions. @GeorgeJahad @hemantk-12 could also 
take a look at this. I'd like to give some opinions and feedbacks on the 
overall approach.
   
   The core of this PR is `SnapshotCache` class and `RefCount` interface. One 
can start from there. `SnapshotCache` is supposed to be thread-safe.
   
   With this approach, every single time an `OmSnapshot` instance is retrieved 
using `snapshotCache.get()`, a corresponding `snapshotCache.release()` has to 
be called at the end of its usage, otherwise it causes leakage because the 
reference count is not decremented. I have identified all existing usages of 
`snapshotCache.get()`, wrapped in another helper method or not, on latest 
master branch and added the logic (or TODO) there on how they should be 
properly handled.
   
   One can very, very easily forget to release the handle (as we have seen with 
all the RDB handle leaks Duong has been painstakingly dealing with over the 
past few months), so I don't think this approach is the cleanest. However, 
putting the decrement ref count logic in `OmSnapshot#close` (so that we can 
wrap `OmSnapshot` inside a try-with-resources) would also break the 
encapsulation thus considered hacky by myself.
   
   I put sanity checks everywhere just to be defensive.
   
   In the meantime, I will explore if 
[`CacheBuilder#weakValues()`](https://www.javadoc.io/doc/com.google.guava/guava/latest/com/google/common/cache/CacheBuilder.html#weakValues())
 would be a viable alternative (potential approach 2), where we could probably 
rely on JVM itself to do the "reference counting" on the instances.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-7935
   
   ## How was this patch tested?
   
   - [ ] Unit test to be finished.
   - [ ] Tests that covers existing snapshot cache use-cases, that is all 
snapshot operations that read or write snapshot DB, should all pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to