[ 
https://issues.apache.org/jira/browse/GEODE-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce J Schuchardt updated GEODE-8385:
--------------------------------------
    Fix Version/s:     (was: 1.14.0)

> hang recovering from disk with cyclic dependencies
> --------------------------------------------------
>
>                 Key: GEODE-8385
>                 URL: https://issues.apache.org/jira/browse/GEODE-8385
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, persistence
>            Reporter: Bruce J Schuchardt
>            Assignee: Bruce J Schuchardt
>            Priority: Major
>              Labels: no-release-note, pull-request-available
>             Fix For: 1.13.0
>
>
> In a test cluster using replicated persistent Regions all of the servers were 
> shut down and restarted.  The restart hung showing a cycle in disk store 
> dependencies.
>  {noformat}
> [info 2020/05/29 03:02:36.635 PDT <Thread-18> tid=0x8f] Region /Region_14 has 
> potentially stale data. It is waiting for another online member to recover 
> the latest data.My persistent id:
>   DiskStore ID: a175354a-d27d-4575-9916-16fd7ff7ea67  Name: 
> persistgemfire4_host1_4194  Location: 
> /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_5_persist4_disk_1
> Members with potentially new data:[  
> DiskStore ID: 2d77752e-507d-4425-a382-a5856c61938f  Name: 
> persistgemfire10_host1_4208  Location: 
> /10.32.110.100:/var/vcap/data/rundir/concRecoverAllV4O41/concRecoverAll-0529-024642/vm_2_persist10_disk_1]
> Use the gfsh show missing-disk-stores command to see all disk stores that are 
> being waited on by other members.
> {noformat}
> After looking at the logs for all members, the "members with potentially new 
> data" for each member were found to be:
> {noformat}
> Member | Members with potentially new data
> --------+----------------------------------
> 1 | all
> 2 | 4
> 3 | 4
> 4 | 10
> 5 | 2, 3, 4, 8, 10
> 6 | 2, 3, 4, 5, 7, 8, 10
> 7 | 3, 4, 10
> 8 | 3, 4, 10
> 9 | 2, 3, 4, 5, 7, 8, 10
> 10 | 3
> {noformat}
> It appears that there is a cycle in this "waiting for another online member" 
> graph between 3 > 4 > 10 > 3.
> The problem seems to have cropped up after the fix for GEODE-7196 was merged. 
>  That changed the timing of member-departed notifications such that a server 
> might close a Region's Persistence Advisor before getting notification that 
> another server was shutting down.  We used to do this notification upon 
> receipt of a ShutdownMessage but now we only do it when the membership view 
> has changed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to