[GitHub] [geode] jvarenina commented on a change in pull request #7145: GEODE-9854: Orphaned .drf file causing memory leak

GitBox Tue, 07 Dec 2021 06:46:44 -0800


jvarenina commented on a change in pull request #7145:
URL: https://github.com/apache/geode/pull/7145#discussion_r764065465




##########
File path: geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
##########
@@ -1303,16 +1305,18 @@ void close(DiskRegion dr) {
     // while a krf is being created can not close a region
     lockCompactor();
     try {
-      addUnrecoveredRegion(dr.getId());
-      DiskRegionInfo dri = getDRI(dr);
-      if (dri != null) {
-        long clearCount = dri.clear(null);
-        if (clearCount != 0) {
-          totalLiveCount.addAndGet(-clearCount);
-          // no need to call handleNoLiveValues because we now have an
-          // unrecovered region.
+      if (!isDrfOnly()) {

Review comment:
       We have created the following test cases (for more information please 
check comments in test cases):
   
   1. Recovery of single region after cache is restarted 
(`testCompactorRegionMapDeletedForOnlyDrfOplogAfterCompactionAndRecoveryAfterCacheClosed`)
   
   2. Recovery of single region after region is closed and then recreated 
(`testCompactorRegionMapDeletedForOnlyDrfOplogAfterCompactionAndRecoveryAfterRegionClose`)
   
   In all cases all data have been successfully recovered from Oplog files, and 
loaded into the cache memory. It seems that this counter in only used to skip 
compaction of Oplog file (.crf file) when all region that use Oplog (.crf file) 
didn't yet recover (e.g. region were closed). All orphaned/only .drf files that 
reference older .crf Opolg cannot be deleted until older .crf Oplog are 
deleted. So having `unrecoveredRegionCount` on .crf also prevents deleting any 
newer .drf file that could contain destroy references to it's entries. We could 
not find the case were not marking .drf only Oplog as unrecovered during close 
could cause some issues. Could you please share exact steps for the case that 
you think it will cause problem and I will try to create it?
   
   During these tests we have found another problem in design base (not 
introduced by this PR) when region is closed and then recreated to start the 
recovery. If you inspect this code in close() function you will notice that it 
doesn't make any sense:
   
   ```
   void close(DiskRegion dr) {
     // while a krf is being created can not close a region
     lockCompactor();
     try {
       if (!isDrfOnly()) {
         addUnrecoveredRegion(dr.getId());
         DiskRegionInfo dri = getDRI(dr);
         if (dri != null) {
           long clearCount = dri.clear(null);
           if (clearCount != 0) {
             totalLiveCount.addAndGet(-clearCount);
             // no need to call handleNoLiveValues because we now have an
             // unrecovered region.
           }
           regionMap.get().remove(dr.getId(), dri);
         }
       }
     } finally {
       unlockCompactor();
     }
   }
   private void addUnrecoveredRegion(long drId) {
     DiskRegionInfo dri = getOrCreateDRI(drId);
     if (dri.testAndSetUnrecovered()) {
       unrecoveredRegionCount.incrementAndGet();
     }
   }
   
   ```
   Please notice that `addUnrecoveredRegion()` marks `DiskRegionInfo` object as 
`unrecovered` and increments counter `unrecoveredRegionCount`. This 
`DiskRegionInfo` object is contained in `regionMap` structure. Then afterwards 
it removes `DiskRegionInfo` object (that was previously marked as 
`unrecovered`) from the `regionMap`. This doesn't make any sense, it updated 
object and then removed it from map to be garbage collected. As you will see 
later on this will cause some issues when region is recovered.
   
   Please check this code at recovery:
   ```
   
   /**
    * For each dri that this oplog has that is currently unrecoverable check to 
see if a DiskRegion
    * that is recoverable now exists.
    */
   void checkForRecoverableRegion(DiskRegionView dr) {
     if (unrecoveredRegionCount.get() > 0) {
       DiskRegionInfo dri = getDRI(dr);
       if (dri != null) {
         if (dri.testAndSetRecovered(dr)) {
           unrecoveredRegionCount.decrementAndGet();
         }
       }
     }
   }
   
   ```
   The problem is that geode will not clear counter `unrecoveredRegionCount` in 
Oplog objects after recovery is done. This is because 
`checkForRecoverableRegion` will check `unrecoveredRegionCount` counter and 
perform `testAndSetRecovered`. The `testAndSetRecovered` will always return 
false, because non of the `DiskRegionInfo` objects in region map have 
`unrecovered` flag set to **true** (all object marked as `unrecovered` were 
deleted by close(), and then they were recreated during recovery.... see 
**note** below). The problem here is that all Oplogs will be fully recovered 
with the counter incorrectly indicating `unrecoveredRegionCount>0`. This will 
later on prevent the compaction of recovered Oplogs (the files that have .crf, 
.drf and .krf) when they reach compaction threshold.
   
   **Note:** During recovery `regionMap` will be recreated from the Oplog 
files. Since all `DiskRegionInfo` objects are deleted from `regionMap` during 
the `close()`, they will be recreated by using function `initRecoveredEntry` 
during the recovery. All `DiskRegionInfo` will be created with flag 
`unrecovered` set to false.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [geode] jvarenina commented on a change in pull request #7145: GEODE-9854: Orphaned .drf file causing memory leak

Reply via email to