Barry Oglesby created GEODE-6931:
------------------------------------
Summary: A failed RemotePutMessage can cause a
PersistentReplicatesOfflineException to be thrown when no persistent members
are offline
Key: GEODE-6931
URL: https://issues.apache.org/jira/browse/GEODE-6931
Project: Geode
Issue Type: Bug
Components: messaging
Reporter: Barry Oglesby
One of the places that RemotePutMessage is sent is DistributedRegion virtualPut.
Its sent from this method in this case:
- 2 wan sites
- the member in the receiving site that processes the batch defines the region
as replicate proxy
- other receiving site members define the region as replicate persistent
DistributedRegion virtualPut is invoked by the GatewayReceiverCommand here:
{noformat}
java.lang.Exception: Stack trace
at java.lang.Thread.dumpStack(Thread.java:1333)
at
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:341)
at
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:162)
at
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5549)
at
org.apache.geode.internal.cache.LocalRegion.basicBridgePut(LocalRegion.java:5200)
at
org.apache.geode.internal.cache.tier.sockets.command.GatewayReceiverCommand.cmdExecute(GatewayReceiverCommand.java:429)
{noformat}
In this case, requiresOneHopForMissingEntry called by virtualPut returns true
since a proxy region with other persistent replicates can't generate a version
tag. This causes RemotePutMessage.distribute to be called.
If didDistribute returns false from RemotePutMessage.distribute (meaning the
distribution failed), a PersistentReplicatesOfflineException is thrown
regardless of the actual exception on the remote member:
{noformat}
if (!generateVersionTag && !didDistribute) {
throw new PersistentReplicatesOfflineException();
}
{noformat}
One of the ways that didDistribute can be false is if both the remote wan site
and local wan site are updating the same key at the same time. In that case a
ConcurrentCacheModificationException can occur in the replicate persistent
member (the one processing the RemotePutMessage).
This exception is not logged anywhere, and RemotePutMessage operateOnRegion
doesn't know anything about it.
RemotePutMessage operateOnRegion running in the replicate persistent member
calls:
{noformat}
result = r.getDataView().putEntry(event, this.ifNew, this.ifOld,
this.expectedOldValue,
this.requireOldValue, this.lastModified, true);
{noformat}
If putEntry returns false, it throws a RemoteOperationException which is sent
back to the caller and causes didDistribute to be false.
The result can be false in the RemotePutMessage operateOnRegion method because
of a ConcurrentCacheModificationException:
{noformat}
org.apache.geode.internal.cache.versions.ConcurrentCacheModificationException:
conflicting WAN event detected
at
org.apache.geode.internal.cache.entries.AbstractRegionEntry.processGatewayTag(AbstractRegionEntry.java:1924)
at
org.apache.geode.internal.cache.entries.AbstractRegionEntry.processVersionTag(AbstractRegionEntry.java:1443)
at
org.apache.geode.internal.cache.entries.AbstractOplogDiskRegionEntry.processVersionTag(AbstractOplogDiskRegionEntry.java:165)
at
org.apache.geode.internal.cache.entries.VersionedThinDiskLRURegionEntryHeapStringKey1.processVersionTag(VersionedThinDiskLRURegionEntryHeapStringKey1.java:378)
at
org.apache.geode.internal.cache.AbstractRegionMap.processVersionTag(AbstractRegionMap.java:527)
at
org.apache.geode.internal.cache.map.RegionMapPut.updateEntry(RegionMapPut.java:484)
at
org.apache.geode.internal.cache.map.RegionMapPut.createOrUpdateEntry(RegionMapPut.java:256)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutAndDeliverEvent(AbstractRegionMapPut.java:300)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWithIndexUpdatingInProgress(AbstractRegionMapPut.java:308)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutIfPreconditionsSatisified(AbstractRegionMapPut.java:296)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnSynchronizedRegionEntry(AbstractRegionMapPut.java:282)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutOnRegionEntryInMap(AbstractRegionMapPut.java:273)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.addRegionEntryToMapAndDoPut(AbstractRegionMapPut.java:251)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPutRetryingIfNeeded(AbstractRegionMapPut.java:216)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doWithIndexInUpdateMode(AbstractRegionMapPut.java:198)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.doPut(AbstractRegionMapPut.java:180)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.runWhileLockedForCacheModification(AbstractRegionMapPut.java:119)
at
org.apache.geode.internal.cache.map.RegionMapPut.runWhileLockedForCacheModification(RegionMapPut.java:161)
at
org.apache.geode.internal.cache.map.AbstractRegionMapPut.put(AbstractRegionMapPut.java:169)
at
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2047)
at
org.apache.geode.internal.cache.LocalRegion.virtualPut(LocalRegion.java:5569)
at
org.apache.geode.internal.cache.DistributedRegion.virtualPut(DistributedRegion.java:386)
at
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:162)
at
org.apache.geode.internal.cache.tx.RemotePutMessage.operateOnRegion(RemotePutMessage.java:635)
at
org.apache.geode.internal.cache.tx.RemoteOperationMessage.process(RemoteOperationMessage.java:195)
{noformat}
This exception is caught in LocalRegion.virtualPut but not logged, so there is
no evidence of it. LocalRegion.virtualPut just returns false in that case.
So, to the caller, it looks like a persistent replicated is offline when it
isn't.
A GatewayConflictResolver can help detect this case. If the resolver accepts
the wan event, then the exceptions do not occur. If the resolver rejects the
WAN event, then exceptions will occur.
All they really mean is that the wan event was rejected because it was
conflicting with a local event on the same key.
It would be nice if instead of RemotePutMessage operateOnRegion returning a
generic RemoteOperationException, an actual
ConcurrentCacheModificationException could be returned (or at least a
RemoteOperationException with the ConcurrentCacheModificationException
message). Short of that, logging the ConcurrentCacheModificationException and
throwing something other than the PersistentReplicatesOfflineException in
DistributedRegion virtualPut would be better.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)