jvarenina commented on a change in pull request #7145:
URL: https://github.com/apache/geode/pull/7145#discussion_r764065465
##########
File path: geode-core/src/main/java/org/apache/geode/internal/cache/Oplog.java
##########
@@ -1303,16 +1305,18 @@ void close(DiskRegion dr) {
// while a krf is being created can not close a region
lockCompactor();
try {
- addUnrecoveredRegion(dr.getId());
- DiskRegionInfo dri = getDRI(dr);
- if (dri != null) {
- long clearCount = dri.clear(null);
- if (clearCount != 0) {
- totalLiveCount.addAndGet(-clearCount);
- // no need to call handleNoLiveValues because we now have an
- // unrecovered region.
+ if (!isDrfOnly()) {
Review comment:
We have created the following test cases (for more information please
check comments in test cases):
1. Recovery of single region after cache is restarted
(`testCompactorRegionMapDeletedForOnlyDrfOplogAfterCompactionAndRecoveryAfterCacheClosed`)
2. Recovery of single region after region is closed and then recreated
(`testCompactorRegionMapDeletedForOnlyDrfOplogAfterCompactionAndRecoveryAfterRegionClose`)
In all cases all data have been successfully recovered from Oplog files, and
loaded into the cache memory. It seems that this counter in only used to skip
compaction of Oplog file (.crf file) when all region that use Oplog (.crf file)
didn't yet recover (e.g. region were closed). All orphaned/only .drf files that
reference older .crf Opolg cannot be deleted until older .crf Oplog are
deleted. So having `unrecoveredRegionCount` on .crf also prevents deleting any
newer .drf file that could contain destroy references to it's entries. We could
not find the case were not marking .drf only Oplog as unrecovered during close
could cause some issues. Could you please share exact steps for the case that
you think it will cause problem and I will try to create it?
During these tests we have found another problem in design base (not
introduced by this PR) when region is closed and then recreated to start the
recovery. If you inspect this code in close() function you will notice that it
doesn't make any sense:
```
void close(DiskRegion dr) {
// while a krf is being created can not close a region
lockCompactor();
try {
if (!isDrfOnly()) {
addUnrecoveredRegion(dr.getId());
DiskRegionInfo dri = getDRI(dr);
if (dri != null) {
long clearCount = dri.clear(null);
if (clearCount != 0) {
totalLiveCount.addAndGet(-clearCount);
// no need to call handleNoLiveValues because we now have an
// unrecovered region.
}
regionMap.get().remove(dr.getId(), dri);
}
}
} finally {
unlockCompactor();
}
}
private void addUnrecoveredRegion(long drId) {
DiskRegionInfo dri = getOrCreateDRI(drId);
if (dri.testAndSetUnrecovered()) {
unrecoveredRegionCount.incrementAndGet();
}
}
```
Please notice that `addUnrecoveredRegion()` marks `DiskRegionInfo` object as
`unrecovered` and increments counter `unrecoveredRegionCount`. This
`DiskRegionInfo` object is contained in `regionMap` structure. Then afterwards
it removes `DiskRegionInfo` object (that was previously marked as
`unrecovered`) from the `regionMap`. This doesn't make any sense, it updated
object and then removed it from map to be garbage collected. As you will see
later on this will cause some issues when region is recovered.
Please check this code at recovery:
```
/**
* For each dri that this oplog has that is currently unrecoverable check to
see if a DiskRegion
* that is recoverable now exists.
*/
void checkForRecoverableRegion(DiskRegionView dr) {
if (unrecoveredRegionCount.get() > 0) {
DiskRegionInfo dri = getDRI(dr);
if (dri != null) {
if (dri.testAndSetRecovered(dr)) {
unrecoveredRegionCount.decrementAndGet();
}
}
}
}
```
The problem is that geode will not clear counter `unrecoveredRegionCount` in
Oplog objects after recovery is done. This is because
`checkForRecoverableRegion` will check `unrecoveredRegionCount` counter and
perform `testAndSetRecovered`. The `testAndSetRecovered` will always return
false, because non of the `DiskRegionInfo` objects in region map have
`unrecovered` flag set to **true** (all object marked as `unrecovered` were
deleted by close(), and then they were recreated during recovery.... see
**note** below). The problem here is that all Oplogs will be fully recovered
with the counter incorrectly indicating `unrecoveredRegionCount>0`. This will
later on prevent the compaction of recovered Oplogs (the files that have .crf,
.drf and .krf) when they reach compaction threshold.
**Note:** During recovery `regionMap` will be recreated from the Oplog
files. Since all `DiskRegionInfo` objects are deleted from `regionMap` during
the `close()`, they will be recreated by using function `initRecoveredEntry`
during the recovery. All `DiskRegionInfo` will be created with flag
`unrecovered` set to false.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]