Andy, Thanks for writing up these notes. Some comments below.
Andrew Hisgen wrote: > Notes from Colorado review > Sept 18, 2008 > > These are my notes from the openhacluster review of > Colorado this morning Sept 18, 2008. The notes > are skimpy/sparse. > > 1. Split brain and recovery discussion > > Ashu and others want us to consider refinements/restrictions > to minimize the consequences of split-brain. Want us to > consider outlawing ccr updates, with an admin action to > re-enable them, after a suspected split-brain. Wants us > to try leverage intentional shutdown and panic shutdown > (with its last gasp "I am panicing" message) to not be > in that mode. > Issue of software that updates ccr, other than admin tools, > was mentioned. > You and I discussed this with Ashu, Ellard, and Thorsten this week. I'll let Ellard summarize any changes we are going to make, since this is his area. > 2. Mode of do nothing on suspected split brain > > Nils (? i did not catch his name) advocated the Veritas > approach where services do not move after a split brain > or suspected split brain. Of course, this is in the > context of two node without quorum. As per our discussions with Ashu, Ellard, and Thorsten, we have added this approach as a possibility for phase 3. Of course, all requirements for phases 2 and 3 will be reassessed after phase 1 is complete. > > 3. Why explicit enable of software after package install, > can we not compute in software that package is installed > on all cluster hosts? See my response to Ashu's written comment on this. > > 4. Split-brain recovery > > Observe that issue of recovery when split-brain heals, ie, > when nodes attempt to rejoin with each other pertains to > more than ccr, it also pertains to ZFS and to AVS SNDR. > We need to explain our recovery approach and our recovery > procedure (even if it is manual). Nils(?) mentioned that > the difficult recovery re-emphasizes need for a mode of > operation where services do not move after suspected split > brain. Again, I'll let Ellard summarize any changes in this area. > > 5. Performance > > Performance of membership reconfiguration is likely to be > worse with weak quorum than with traditional quorum, ie, > longer detection time. > > Performance of recovery is worse in that human has to do > manual recovery in more cases. I've added this to the performance section. However, I couldn't really come up with any quantifiable rubric that would allow us to measure whether we meet performance requirements. Any suggestions appreciated. Thanks, Nick