[
https://issues.apache.org/jira/browse/GEODE-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892134#comment-16892134
]
ASF subversion and git services commented on GEODE-6975:
--------------------------------------------------------
Commit b29546c772c28771a6cab4c6c20dad6a52a6be2e in geode's branch
refs/heads/feature/GEODE-6975 from eshu
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=b29546c ]
GEODE-6975: Do not send reply back if hit CancelException
* Do not send reply back if hit CancelException when processing
PrepareNewPersistentMemberMessage. Depends on membership to
determine which member has the last copy of the region, as the
sender has to either wait for the replies back or membership
listener detects the receiver of the message has been departed.
In the later case, the sender should becomes the host of the
last copy and should recover first to avoid the
ConflictingPersistentDataException.
* Use cache's CancelCriterion instead of DistributionManager's
CancelCriterion in sender's reply processor. This to avoid an
issue seen when shutdown occurs during bucket region creation.
Shutdown will wait for all buckets are ready while sender of the
PrepareNewPersistentMemberMessage still waiting for a reply but
won't get it if the receiver is also being shut down. And the
DistributionManager is yet to be shutdown at the time. Using
cache's CancelCriterion will make the sender reply processor to
throw CancelException instead of continue waiting for reply.
> When a redundant copy or replica of a distributed region failed to persistent
> remote member's new persistence id, it should send reply exception back to
> indicate what happened
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: GEODE-6975
> URL: https://issues.apache.org/jira/browse/GEODE-6975
> Project: Geode
> Issue Type: Bug
> Components: persistence, regions
> Affects Versions: 1.1.0
> Reporter: Eric Shu
> Assignee: Eric Shu
> Priority: Major
> Labels: GeodeCommons
> Time Spent: 40m
> Remaining Estimate: 0h
>
> Currently, when a persistent bucket or distributed region is created on
> member A, member A will send its new PersistentMemberID to other hosts (e.g
> member B), so that member B will know and persist A's new ID for the region.
> However, when member B is being shut down during processing the
> PrepareNewPersistentMemberMessage (did not persist A's id), it just send a
> reply message indicate it had persisted. This will cause Member A removes its
> old member id and only persists its new member id. This is wrong as the
> member A could also been shut down at the same time. There is a race that
> member B could be recognized as hosting the last copy for the region. This
> will lead to member B to recover first, and member B can only recover member
> A's old persistent id. This will lead to Member A not able to restart, as B
> does not recognize A's new persistent id.
> [error 2018/09/19 01:18:00.972 PDT dataStoregemfire6_host1_6131 <Recovery
> thread for bucket _B__partitionedRegion_0> tid=0x77] A DiskAccessException
> has occurred while writing to the disk for region
> /__PR/_B__partitionedRegion_0. The cache will be closed.
> org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region
> /__PR/_B__partitionedRegion_0 remote member
> rs-FullRegression19041704a3i3large-hydra-client-62(dataStoregemfire1_host1_5862:5862)<ec><v8>:1025
> with persistent data
> /10.32.109.230:/var/vcap/data/rundir/concParRegHAPersistPdxVA57H/concParRegHAPersistPdx-0919-011540/vm_1_dataStore1_disk_1
> created at timestamp 1537345060760 version 0 diskStoreId
> a35a937a082b4066-af019365b6a5114b name null was not part of the same
> distributed system as the local data from
> /10.32.109.230:/var/vcap/data/rundir/concParRegHAPersistPdxVA57H/concParRegHAPersistPdx-0919-011540/vm_6_dataStore6_disk_1
> created at timestamp 1537344996470 version 0 diskStoreId
> 108be5a03966418f-980c1d88e9b26d1d name null
> at
> org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:521)
> at
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.removeReplicatesIfWeAreEqualToAnyOrElseClearEqualMembers(PersistenceInitialImageAdvisor.java:181)
> at
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.getAdvice(PersistenceInitialImageAdvisor.java:69)
> at
> org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:831)
> at
> org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
> at
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1200)
> at
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1081)
> at
> org.apache.geode.internal.cache.BucketRegion.initialize(BucketRegion.java:258)
> at
> org.apache.geode.internal.cache.LocalRegion.createSubregion(LocalRegion.java:1014)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:779)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:454)
> at
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2895)
> at
> org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDisk(ProxyBucketRegion.java:447)
> at
> org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDiskRecursively(ProxyBucketRegion.java:390)
> at
> org.apache.geode.internal.cache.PRHARedundancyProvider$4.run2(PRHARedundancyProvider.java:1756)
> at
> org.apache.geode.internal.cache.partitioned.RecoveryRunnable.run(RecoveryRunnable.java:58)
> at
> org.apache.geode.internal.cache.PRHARedundancyProvider$4.run(PRHARedundancyProvider.java:1748)
> at java.lang.Thread.run(Thread.java:748)
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)