> -----Original Message-----
> From: Alan Conway [mailto:[email protected]]
> Sent: Friday, December 11, 2009 1:35 PM
> To: [email protected]
> Cc: qpid-dev-apache
> Subject: Re: [c++ cluster] User doc for persistent clusters.
> 
> On 12/11/2009 03:59 PM, Sandy Pratt wrote:
> >
> >> -----Original Message-----
> >> From: Alan Conway [mailto:[email protected]]
> >> Sent: Tuesday, November 24, 2009 7:56 AM
> >> To: Jonathan Robie; qpid-dev-apache
> >> Subject: [c++ cluster] User doc for persistent clusters.
> >>
> >> I put up a user view of the peristent cluster changes, coming soon.
> >> Would
> >> appreaciate any feedback on the doc or the feature it describes.
> >>
> >>
> http://cwiki.apache.org/confluence/display/qpid/Persistent+Cluster+Rest
> >> art+Design+Note
> >
> >
> > Hi Alan,
> >
> > Looks like a great step forward for clustering.  Any hints on what's
> involved in the manual intervention to enable restart from a full
> cluster crash?  I'm eager to kick the tires.
> >
> 
> Basically it amounts to picking the "best" store and marking it clean
> by putting
> a UUID in <data_dir>/cluster/shutdown.uuid, i.e. pretend it was shut
> down cleanly.
> 

Simple enough, thanks!

> I'm working on providing some help to identify the "best" store and
> ultimately
> hope to provide a tool for doing all this a bit more automatically. It
> will
> probably mean running the tool on the data-directory of each cluster
> member
> initially which is a pain - assumes remote logins or shared file
> systems.
> 
> I'd like to find a way to do this from one location without assuming
> shared
> filesystems or remote logins. There was a suggestion that if there's a
> total
> faliure the brokers come up in "admin mode" where they don't accept any
> clients
> except an admin tool. The brokers would collect the info needed to pick
> the
> clean store and mark it clean driven remotely by the admin tool. Does
> this sound
> like a good direction, or do you have any other suggestions on how to
> approach this?

That does sound like a good approach to me.

Suppose the cluster has crashed because of a power or network failure (single 
brokers crashing due to OS or hardware problems isn't the issue here, if I 
understand correctly).

Then the common case is that they all power back up without issue, and can 
unanimously pick the correct journal while having access to all candidate 
journals (hand waving a bit here, maybe*).

The uncommon case is that they all die for whatever reason, then some member of 
the cluster fails to come back up.  At this point, full information from all 
candidate journals is not available, and a unanimous decision cannot be 
reached.  Manual intervention sounds fine here.

*I noticed in some of the JIRA notes that the message store changes are labeled 
with a monotonically increasing sequence number.  Is this derived from the 
Lamport clock implemented by openais (in which case all the events are 
conveniently serialized by the CPG)?  I could be misunderstanding the way the 
CPG works, but if not it sounds like an excellent way to get the cluster back 
in sync.

Sandy


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Reply via email to