[ https://issues.apache.org/jira/browse/GEODE-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce J Schuchardt updated GEODE-8385: -------------------------------------- Fix Version/s: (was: 1.14.0) > hang recovering from disk with cyclic dependencies > -------------------------------------------------- > > Key: GEODE-8385 > URL: https://issues.apache.org/jira/browse/GEODE-8385 > Project: Geode > Issue Type: Bug > Components: membership, persistence > Reporter: Bruce J Schuchardt > Assignee: Bruce J Schuchardt > Priority: Major > Labels: no-release-note, pull-request-available > Fix For: 1.13.0 > > > In a test cluster using replicated persistent Regions all of the servers were > shut down and restarted. The restart hung showing a cycle in disk store > dependencies. > {noformat} > [info 2020/05/29 03:02:36.635 PDT <Thread-18> tid=0x8f] Region /Region_14 has > potentially stale data. It is waiting for another online member to recover > the latest data.My persistent id: > DiskStore ID: a175354a-d27d-4575-9916-16fd7ff7ea67 Name: > persistgemfire4_host1_4194 Location: > /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_5_persist4_disk_1 > Members with potentially new data:[ > DiskStore ID: 2d77752e-507d-4425-a382-a5856c61938f Name: > persistgemfire10_host1_4208 Location: > /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_2_persist10_disk_1] > Use the gfsh show missing-disk-stores command to see all disk stores that are > being waited on by other members. > {noformat} > After looking at the logs for all members, the "members with potentially new > data" for each member were found to be: > {noformat} > Member | Members with potentially new data > --------+---------------------------------- > 1 | all > 2 | 4 > 3 | 4 > 4 | 10 > 5 | 2, 3, 4, 8, 10 > 6 | 2, 3, 4, 5, 7, 8, 10 > 7 | 3, 4, 10 > 8 | 3, 4, 10 > 9 | 2, 3, 4, 5, 7, 8, 10 > 10 | 3 > {noformat} > It appears that there is a cycle in this "waiting for another online member" > graph between 3 > 4 > 10 > 3. > The problem seems to have cropped up after the fix for GEODE-7196 was merged. > That changed the timing of member-departed notifications such that a server > might close a Region's Persistence Advisor before getting notification that > another server was shutting down. We used to do this notification upon > receipt of a ShutdownMessage but now we only do it when the membership view > has changed. -- This message was sent by Atlassian Jira (v8.3.4#803005)