Andy,

Thanks for writing up these notes. Some comments below.

Andrew Hisgen wrote:
> Notes from Colorado review
> Sept 18, 2008
> 
> These are my notes from the openhacluster review of
> Colorado this morning Sept 18, 2008.  The notes
> are skimpy/sparse.
> 
> 1.  Split brain and recovery discussion
> 
> Ashu and others want us to consider refinements/restrictions
> to minimize the consequences of split-brain.  Want us to
> consider outlawing ccr updates, with an admin action to
> re-enable them, after a suspected split-brain.  Wants us
> to try leverage intentional shutdown and panic shutdown
> (with its last gasp "I am panicing" message) to not be
> in that mode.
> Issue of software that updates ccr, other than admin tools,
> was mentioned.
> 

You and I discussed this with Ashu, Ellard, and Thorsten this week. I'll 
let Ellard summarize any changes we are going to make, since this is his 
area.

> 2.  Mode of do nothing on suspected split brain
> 
> Nils (? i did not catch his name) advocated the Veritas
> approach where services do not move after a split brain
> or suspected split brain.  Of course, this is in the
> context of two node without quorum.

As per our discussions with Ashu, Ellard, and Thorsten, we have added 
this approach as a possibility for phase 3. Of course, all requirements 
for phases 2 and 3 will be reassessed after phase 1 is complete.

> 
> 3.  Why explicit enable of software after package install,
> can we not compute in software that package is installed
> on all cluster hosts?

See my response to Ashu's written comment on this.

> 
> 4.  Split-brain recovery
> 
> Observe that issue of recovery when split-brain heals, ie,
> when nodes attempt to rejoin with each other pertains to
> more than ccr, it also pertains to ZFS and to AVS SNDR.
> We need to explain our recovery approach and our recovery
> procedure (even if it is manual).  Nils(?) mentioned that
> the difficult recovery re-emphasizes need for a mode of
> operation where services do not move after suspected split
> brain.

Again, I'll let Ellard summarize any changes in this area.

> 
> 5.  Performance
> 
> Performance of membership reconfiguration is likely to be
> worse with weak quorum than with traditional quorum, ie,
> longer detection time.
> 
> Performance of recovery is worse in that human has to do
> manual recovery in more cases.

I've added this to the performance section. However, I couldn't really 
come up with any quantifiable rubric that would allow us to measure 
whether we meet performance requirements. Any suggestions appreciated.

Thanks,
Nick


Reply via email to