Hi David,

Ken Gaillot got me with this question:
Since corosync/pcmk can be healed from such a case, why not DLM?
Please look at detailed discussion here:
       [1] https://github.com/ClusterLabs/pacemaker/pull/839

Here is my thoughts, but I'm not sure, CMIIW please:
time: T; cluster:A, B, C; and if we have a lockspace named after $uuid for a shared disk volume, and a CPG for lockspace $uuid; $uuid CPG has
members of A, B and C when things are OK, but:

T: quorum lost; cluster partitions into 3 parts; lockspace $uuid cannot perform any lockspace operations because cluster is not quorate;

T+1: quorum regained; dlm_controld daemon CPG has not done its merging/fencing stuff; so here are 2 questions:
Q1: what's stateful merged node?
I've seen the comments within code;-) It means a lockspace has been on the node before it sends protocol message?

Q2: what if we add the stateful merged nodes to dlm_controld daemon cpg instead of fencing them?

if so, CPG $uuid now, e.g. from the perspective of A, may has only one memeber - A itself, it can perform lockspace now because cluster is quorate now (and if we skip fencing); B and C do likewise; then for each node, it looks like every node own this volume; so corruption may happen?

Thanks a lot,
Eric

On 05/17/2016 08:10 PM, Eric Ren wrote:
Hi David,
This is just a draft patch for you to review;-) There's an issue I'm
not sure: where should we clear "stateful_merge_wait"?

And I need more communications with pacemaker guys and more time for testing.
I will send you the formal patch if things get done;-)


Reply via email to