Bruce J Schuchardt created GEODE-8385:
-----------------------------------------
Summary: hang recovering from disk with cyclic dependencies
Key: GEODE-8385
URL: https://issues.apache.org/jira/browse/GEODE-8385
Project: Geode
Issue Type: Bug
Components: membership, persistence
Reporter: Bruce J Schuchardt
Fix For: 1.13.0
In a test cluster using replicated persistent Regions all of the servers were
shut down and restarted. The restart hung showing a cycle in disk store
dependencies.
{noformat}
[info 2020/05/29 03:02:36.635 PDT <Thread-18> tid=0x8f] Region /Region_14 has
potentially stale data. It is waiting for another online member to recover the
latest data.My persistent id:
DiskStore ID: a175354a-d27d-4575-9916-16fd7ff7ea67 Name:
persistgemfire4_host1_4194 Location:
/10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_5_persist4_disk_1
Members with potentially new data:[
DiskStore ID: 2d77752e-507d-4425-a382-a5856c61938f Name:
persistgemfire10_host1_4208 Location:
/10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_2_persist10_disk_1]
Use the gfsh show missing-disk-stores command to see all disk stores that are
being waited on by other members.
{noformat}
After looking at the logs for all members, the "members with potentially new
data" for each member were found to be:
{noformat}
Member | Members with potentially new data
--------+----------------------------------
1 | all
2 | 4
3 | 4
4 | 10
5 | 2, 3, 4, 8, 10
6 | 2, 3, 4, 5, 7, 8, 10
7 | 3, 4, 10
8 | 3, 4, 10
9 | 2, 3, 4, 5, 7, 8, 10
10 | 3
{noformat}
It appears that there is a cycle in this "waiting for another online member"
graph between 3 > 4 > 10 > 3.
The problem seems to have cropped up after the fix for GEODE-7196 was merged.
That changed the timing of member-departed notifications such that a server
might close a Region's Persistence Advisor before getting notification that
another server was shutting down. We used to do this notification upon receipt
of a ShutdownMessage but now we only do it when the membership view has changed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)