[jira] [Commented] (GEODE-6975) When a redundant copy or replica of a distributed region failed to persistent remote member's new persistence id, it should send reply exception back to indicate what happened

ASF subversion and git services (JIRA) Wed, 24 Jul 2019 12:54:23 -0700


    [ 
https://issues.apache.org/jira/browse/GEODE-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892134#comment-16892134
 ]


ASF subversion and git services commented on GEODE-6975:
--------------------------------------------------------

Commit b29546c772c28771a6cab4c6c20dad6a52a6be2e in geode's branch 
refs/heads/feature/GEODE-6975 from eshu
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=b29546c ]

GEODE-6975: Do not send reply back if hit CancelException

 * Do not send reply back if hit CancelException when processing
   PrepareNewPersistentMemberMessage. Depends on membership to
   determine which member has the last copy of the region, as the
   sender has to either wait for the replies back or membership
   listener detects the receiver of the message has been departed.
   In the later case, the sender should becomes the host of the
   last copy and should recover first to avoid the
   ConflictingPersistentDataException.
 * Use cache's CancelCriterion instead of DistributionManager's
   CancelCriterion in sender's reply processor. This to avoid an
   issue seen when shutdown occurs during bucket region creation.
   Shutdown will wait for all buckets are ready while sender of the
   PrepareNewPersistentMemberMessage still waiting for a reply but
   won't get it if the receiver is also being shut down. And the
   DistributionManager is yet to be shutdown at the time. Using
   cache's CancelCriterion will make the sender reply processor to
   throw CancelException instead of continue waiting for reply.


> When a redundant copy or replica of a distributed region failed to persistent 
> remote member's new persistence id, it should send reply exception back to 
> indicate what happened
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-6975
>                 URL: https://issues.apache.org/jira/browse/GEODE-6975
>             Project: Geode
>          Issue Type: Bug
>          Components: persistence, regions
>    Affects Versions: 1.1.0
>            Reporter: Eric Shu
>            Assignee: Eric Shu
>            Priority: Major
>              Labels: GeodeCommons
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, when a persistent bucket or distributed region is created on 
> member A, member A will send its new PersistentMemberID to other hosts (e.g 
> member B), so that member B will know and persist A's new ID for the region. 
> However, when member B is being shut down during processing the 
> PrepareNewPersistentMemberMessage (did not persist A's id), it just send a 
> reply message indicate it had persisted. This will cause Member A removes its 
> old member id and only persists its new member id. This is wrong as the 
> member A could also been shut down at the same time. There is a race that 
> member B could be recognized as hosting the last copy for the region. This 
> will lead to member B to recover first, and member B can only recover member 
> A's old persistent id. This will lead to Member A not able to restart, as B 
> does not recognize A's new persistent id.
> [error 2018/09/19 01:18:00.972 PDT dataStoregemfire6_host1_6131 <Recovery 
> thread for bucket _B__partitionedRegion_0> tid=0x77] A DiskAccessException 
> has occurred while writing to the disk for region 
> /__PR/_B__partitionedRegion_0. The cache will be closed.
> org.apache.geode.cache.persistence.ConflictingPersistentDataException: Region 
> /__PR/_B__partitionedRegion_0 remote member 
> rs-FullRegression19041704a3i3large-hydra-client-62(dataStoregemfire1_host1_5862:5862)<ec><v8>:1025
>  with persistent data 
> /10.32.109.230:/var/vcap/data/rundir/concParRegHAPersistPdxVA57H/concParRegHAPersistPdx-0919-011540/vm_1_dataStore1_disk_1
>  created at timestamp 1537345060760 version 0 diskStoreId 
> a35a937a082b4066-af019365b6a5114b name null was not part of the same 
> distributed system as the local data from 
> /10.32.109.230:/var/vcap/data/rundir/concParRegHAPersistPdxVA57H/concParRegHAPersistPdx-0919-011540/vm_6_dataStore6_disk_1
>  created at timestamp 1537344996470 version 0 diskStoreId 
> 108be5a03966418f-980c1d88e9b26d1d name null
>         at 
> org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.checkMyStateOnMembers(PersistenceAdvisorImpl.java:521)
>         at 
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.removeReplicatesIfWeAreEqualToAnyOrElseClearEqualMembers(PersistenceInitialImageAdvisor.java:181)
>         at 
> org.apache.geode.internal.cache.persistence.PersistenceInitialImageAdvisor.getAdvice(PersistenceInitialImageAdvisor.java:69)
>         at 
> org.apache.geode.internal.cache.persistence.PersistenceAdvisorImpl.getInitialImageAdvice(PersistenceAdvisorImpl.java:831)
>         at 
> org.apache.geode.internal.cache.persistence.CreatePersistentRegionProcessor.getInitialImageAdvice(CreatePersistentRegionProcessor.java:52)
>         at 
> org.apache.geode.internal.cache.DistributedRegion.getInitialImageAndRecovery(DistributedRegion.java:1200)
>         at 
> org.apache.geode.internal.cache.DistributedRegion.initialize(DistributedRegion.java:1081)
>         at 
> org.apache.geode.internal.cache.BucketRegion.initialize(BucketRegion.java:258)
>         at 
> org.apache.geode.internal.cache.LocalRegion.createSubregion(LocalRegion.java:1014)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.createBucketRegion(PartitionedRegionDataStore.java:779)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabFreeBucket(PartitionedRegionDataStore.java:454)
>         at 
> org.apache.geode.internal.cache.PartitionedRegionDataStore.grabBucket(PartitionedRegionDataStore.java:2895)
>         at 
> org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDisk(ProxyBucketRegion.java:447)
>         at 
> org.apache.geode.internal.cache.ProxyBucketRegion.recoverFromDiskRecursively(ProxyBucketRegion.java:390)
>         at 
> org.apache.geode.internal.cache.PRHARedundancyProvider$4.run2(PRHARedundancyProvider.java:1756)
>         at 
> org.apache.geode.internal.cache.partitioned.RecoveryRunnable.run(RecoveryRunnable.java:58)
>         at 
> org.apache.geode.internal.cache.PRHARedundancyProvider$4.run(PRHARedundancyProvider.java:1748)
>         at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (GEODE-6975) When a redundant copy or replica of a distributed region failed to persistent remote member's new persistence id, it should send reply exception back to indicate what happened

Reply via email to