[ 
https://issues.apache.org/jira/browse/GEODE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Smith closed GEODE-1088.
----------------------------
    Assignee: Dan Smith

> shutdown-all should skip member dependency checks when restarted
> ----------------------------------------------------------------
>
>                 Key: GEODE-1088
>                 URL: https://issues.apache.org/jira/browse/GEODE-1088
>             Project: Geode
>          Issue Type: Improvement
>          Components: management
>            Reporter: Soubhik Chakraborty
>            Assignee: Dan Smith
>
> Right now a Geode cluster when started, it waits for other members to start 
> (for persistent regions only). These members are recorded when this member is 
> stopped via individual stop or as part of shutdown-all.
> Because {code}shutdown-all{code} indicates the entire cluster is going down 
> and if incoming traffic is stopped first, all cluster members can be 
> gauranteed to be in a consistent state while its stopped. Therefore, members 
> stopped cleanly using shutdown-all can skip member dependency checks while 
> starting up.
> A more detailed proposition is listed in following ticket
> https://snappydata.atlassian.net/browse/SNAP-586
> I need team's help (esp. [~upthewaterspout], [~bschuchardt]) to share any 
> insight, pitfalls they see in the proposition. Listing the proposed sequence 
> of steps here for reference.
> There are 2 main cases we need to tackle.
> # make shutdown-all two phase (assuming all members are healthy)
>   #* Phase-I ; stop network interfaces of all servers (via p2p messaging)
>   #* wait for inflight operations to complete viz.
>     #*# ongoing commits ? (note: due to n/w stop user will already see 
> failure)
>     #*# restrict new commits (n/w stopped already, so new commits won't 
> arrive)
>     #*# rollback existing transactions (as new commit/rollback won't come 
> from user)
>     #*# introduce an op counter and monitor it for zero on each member for 
> non-tx operations (distribution stats counter can be used ?)
>     #*# invoke disk sync procedure ?
>   #* Phase-II : trigger shutdown on each of the VMs (via p2p messaging)
>     #** right now during shutdown-all there are lots of chatter at jgroups 
> level suspecting each other. should it be attempted to avoid ?
>   #* skip member dependency check during restart by reading a recorded entry 
> somewhere (data dictionary ?)
> # if one or more members are unreachable (hunged member), only way remains is 
> to shutdown via script. 
>   #* Need to think more on how to recognize hunged members and what should be 
> done before "kill -9" like record those member list.
>   #* these recorded members should be started at last after starting all 
> those members which did shutdown cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to