[jira] [Updated] (GEODE-9881) Fully recoverd Oplogs object indicating unrecoveredRegionCount>0 preventing compaction

Jakov Varenina (Jira) Thu, 09 Dec 2021 06:11:27 -0800


     [ 
https://issues.apache.org/jira/browse/GEODE-9881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jakov Varenina updated GEODE-9881:
----------------------------------
    Description: 
We have found problem in case when region is closed with Region.close() and 
then recreated to start the recovery. If you inspect this code in close() 
function you will notice that it doesn't make any sense:
{code:java}
  void close(DiskRegion dr) {
    // while a krf is being created can not close a region
    lockCompactor();
    try {
      if (!isDrfOnly()) {
        DiskRegionInfo dri = getDRI(dr);
        if (dri != null) {
          long clearCount = dri.clear(null);
          if (clearCount != 0) {
            totalLiveCount.addAndGet(-clearCount);
            // no need to call handleNoLiveValues because we now have an
            // unrecovered region.
          }
          regionMap.get().remove(dr.getId(), dri);
        }
        addUnrecoveredRegion(dr.getId());
      }
    } finally {
      unlockCompactor();
    }
  }

{code}
Please notice that addUnrecoveredRegion() marks DiskRegionInfo object as 
unrecovered and increments counter unrecoveredRegionCount. This DiskRegionInfo 
object is contained in regionMap structure. Then afterwards it removes 
DiskRegionInfo object (that was previously marked as unrecovered) from the 
regionMap. This doesn't make any sense, it updated object and then removed it 
from map to be garbage collected. As you will see later on this will cause some 
issues when region is recovered.

Please check this code at recovery:
{code:java}
/**
 * For each dri that this oplog has that is currently unrecoverable check to 
see if a DiskRegion
 * that is recoverable now exists.
 */
void checkForRecoverableRegion(DiskRegionView dr) {
  if (unrecoveredRegionCount.get() > 0) {
    DiskRegionInfo dri = getDRI(dr);
    if (dri != null) {
      if (dri.testAndSetRecovered(dr)) {
        unrecoveredRegionCount.decrementAndGet();
      }
    }
  }
}
{code}
The problem is that geode will not clear counter unrecoveredRegionCount in 
Oplog objects after recovery is done. This is because checkForRecoverableRegion 
will check unrecoveredRegionCount counter and perform testAndSetRecovered. The 
testAndSetRecovered will always return false, because non of the DiskRegionInfo 
objects in region map have unrecovered flag set to true (all object marked as 
unrecovered were deleted by close(), and then they were recreated during 
recovery.... see note below). The problem here is that all Oplogs will be fully 
recovered with the counter incorrectly indicating unrecoveredRegionCount>0. 
This will later on prevent the compaction of recovered Oplogs (the files that 
have .crf, .drf and .krf) when they reach compaction threshold.

Note: During recovery regionMap will be recreated from the Oplog files. Since 
all DiskRegionInfo objects are deleted from regionMap during the close(), they 
will be recreated by using function initRecoveredEntry during the recovery. All 
DiskRegionInfo will be created with flag unrecovered set to false.

 

  was:
We have found problem in case when region is closed with 
{color:#ffffff}Region.close(){color} and then recreated to start the recovery. 
If you inspect this code in close() function you will notice that it doesn't 
make any sense:
{code:java}
  void close(DiskRegion dr) {
    // while a krf is being created can not close a region
    lockCompactor();
    try {
      if (!isDrfOnly()) {
        DiskRegionInfo dri = getDRI(dr);
        if (dri != null) {
          long clearCount = dri.clear(null);
          if (clearCount != 0) {
            totalLiveCount.addAndGet(-clearCount);
            // no need to call handleNoLiveValues because we now have an
            // unrecovered region.
          }
          regionMap.get().remove(dr.getId(), dri);
        }
        addUnrecoveredRegion(dr.getId());
      }
    } finally {
      unlockCompactor();
    }
  }

{code}
Please notice that addUnrecoveredRegion() marks DiskRegionInfo object as 
unrecovered and increments counter unrecoveredRegionCount. This DiskRegionInfo 
object is contained in regionMap structure. Then afterwards it removes 
DiskRegionInfo object (that was previously marked as unrecovered) from the 
regionMap. This doesn't make any sense, it updated object and then removed it 
from map to be garbage collected. As you will see later on this will cause some 
issues when region is recovered.

Please check this code at recovery:
{code:java}
/**
 * For each dri that this oplog has that is currently unrecoverable check to 
see if a DiskRegion
 * that is recoverable now exists.
 */
void checkForRecoverableRegion(DiskRegionView dr) {
  if (unrecoveredRegionCount.get() > 0) {
    DiskRegionInfo dri = getDRI(dr);
    if (dri != null) {
      if (dri.testAndSetRecovered(dr)) {
        unrecoveredRegionCount.decrementAndGet();
      }
    }
  }
}
{code}
The problem is that geode will not clear counter unrecoveredRegionCount in 
Oplog objects after recovery is done. This is because checkForRecoverableRegion 
will check unrecoveredRegionCount counter and perform testAndSetRecovered. The 
testAndSetRecovered will always return false, because non of the DiskRegionInfo 
objects in region map have unrecovered flag set to true (all object marked as 
unrecovered were deleted by close(), and then they were recreated during 
recovery.... see note below). The problem here is that all Oplogs will be fully 
recovered with the counter incorrectly indicating unrecoveredRegionCount>0. 
This will later on prevent the compaction of recovered Oplogs (the files that 
have .crf, .drf and .krf) when they reach compaction threshold.

Note: During recovery regionMap will be recreated from the Oplog files. Since 
all DiskRegionInfo objects are deleted from regionMap during the close(), they 
will be recreated by using function initRecoveredEntry during the recovery. All 
DiskRegionInfo will be created with flag unrecovered set to false.

 


> Fully recoverd Oplogs object indicating unrecoveredRegionCount>0 preventing 
> compaction
> --------------------------------------------------------------------------------------
>
>                 Key: GEODE-9881
>                 URL: https://issues.apache.org/jira/browse/GEODE-9881
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>
> We have found problem in case when region is closed with Region.close() and 
> then recreated to start the recovery. If you inspect this code in close() 
> function you will notice that it doesn't make any sense:
> {code:java}
>   void close(DiskRegion dr) {
>     // while a krf is being created can not close a region
>     lockCompactor();
>     try {
>       if (!isDrfOnly()) {
>         DiskRegionInfo dri = getDRI(dr);
>         if (dri != null) {
>           long clearCount = dri.clear(null);
>           if (clearCount != 0) {
>             totalLiveCount.addAndGet(-clearCount);
>             // no need to call handleNoLiveValues because we now have an
>             // unrecovered region.
>           }
>           regionMap.get().remove(dr.getId(), dri);
>         }
>         addUnrecoveredRegion(dr.getId());
>       }
>     } finally {
>       unlockCompactor();
>     }
>   }
> {code}
> Please notice that addUnrecoveredRegion() marks DiskRegionInfo object as 
> unrecovered and increments counter unrecoveredRegionCount. This 
> DiskRegionInfo object is contained in regionMap structure. Then afterwards it 
> removes DiskRegionInfo object (that was previously marked as unrecovered) 
> from the regionMap. This doesn't make any sense, it updated object and then 
> removed it from map to be garbage collected. As you will see later on this 
> will cause some issues when region is recovered.
> Please check this code at recovery:
> {code:java}
> /**
>  * For each dri that this oplog has that is currently unrecoverable check to 
> see if a DiskRegion
>  * that is recoverable now exists.
>  */
> void checkForRecoverableRegion(DiskRegionView dr) {
>   if (unrecoveredRegionCount.get() > 0) {
>     DiskRegionInfo dri = getDRI(dr);
>     if (dri != null) {
>       if (dri.testAndSetRecovered(dr)) {
>         unrecoveredRegionCount.decrementAndGet();
>       }
>     }
>   }
> }
> {code}
> The problem is that geode will not clear counter unrecoveredRegionCount in 
> Oplog objects after recovery is done. This is because 
> checkForRecoverableRegion will check unrecoveredRegionCount counter and 
> perform testAndSetRecovered. The testAndSetRecovered will always return 
> false, because non of the DiskRegionInfo objects in region map have 
> unrecovered flag set to true (all object marked as unrecovered were deleted 
> by close(), and then they were recreated during recovery.... see note below). 
> The problem here is that all Oplogs will be fully recovered with the counter 
> incorrectly indicating unrecoveredRegionCount>0. This will later on prevent 
> the compaction of recovered Oplogs (the files that have .crf, .drf and .krf) 
> when they reach compaction threshold.
> Note: During recovery regionMap will be recreated from the Oplog files. Since 
> all DiskRegionInfo objects are deleted from regionMap during the close(), 
> they will be recreated by using function initRecoveredEntry during the 
> recovery. All DiskRegionInfo will be created with flag unrecovered set to 
> false.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (GEODE-9881) Fully recoverd Oplogs object indicating unrecoveredRegionCount>0 preventing compaction

Reply via email to