Re: [Openais] correlating events

David Teigland Thu, 10 Sep 2009 12:34:40 -0700

On Wed, Sep 02, 2009 at 09:23:08AM -0500, David Teigland wrote:
> > > 1. correlating events from different services locally
> > > 
> > > I get nodedown from both cman (or quorum service) and cpg.  I need to
> > > correlate them with each other.  When I get a cpg nodedown for node A, I 
> > > don't
> > > know which cman nodedown for A it refers to: one of multiple in the past 
> > > or
> > > one in the future that cman hasn't reported yet.
> > > 
> > 
> > Correlation could be solved by addition of api to cman, cpg, and quorum
> > to retrieve the globally unique ring id for the last configuration
> > change delivered to the application.
> > 
> > If you agree, we can work on the implementation for corosync 1.1.
> > Adding this to CPG is trivial, not sure about other services.
> > 
> > Our policies wrt x.y.z would not be violated with this change.
> > 
> > As an example, the API for cpg might look like
> > 
> > cpg_ringid_get (handle, &ring_id);
> > 
> > Then ring_id could be memcmp'ed in the application.
> > 
> > This would retrieve the last ring id delivered to the application (not
> > the current ring id known to the cpg service).


Thinking more about this, and I think there are two different kinds of ringid
queries that we'd want from cpg.  It's because all new ringid's result in
cman/quorum confchgs, but not all ringid changes result in cpg confchgs.

My understanding is that ringid (actually ringid sequence number) is
incremented for each new ring (each cluster membership change).

a. For a given ringid from cpg for a nodedown confchg, need to know that
   cman/quorum has seen the same nodedown.

Comparing the ringid of the cpg nodedown confchg and the ringid from cman
should work for this.  If cman ringid is greater than or equal to the ringid
of the cpg nodedown confchg, then we know cman is aware of the cpg nodedown.
cman ringid may be larger if another node has since joined the cluster but not
the cpg, or if a cluster member failed that was not a member of the cpg.

b. For a given ringid from cman/quorum, need to know that any confchgs up to
   that same ringid have been delivered to the cpg.

These imply two different ringid values for cpg:

1. the ringid of the last confchg delivered to the cpg
2. the ringid that cpg deliveries are up to date with, which may be greater
   than the ringid of the last confchg delivered if the latest ring changes
   have not altered the cpg membership

example

cluster ringid = 40
cluster members = 1,2,3,4,5
cpg members = 1,2,3,4

node 1 fails
cluster ringid = 44
cluster members = 2,3,4,5
cpg members = 2,3,4

cman_ringid(&id)
  id = 44

cpg_ringid(h, &id1, &id2)
  before the app dispatches the cpg confchg callback
  id1 = 40, id2 = 40
  after the app dispatches the cpg confchg callback
  id1 = 44, id2 = 44

node 5 fails
cluster ringid = 48
cluster members = 2,3,4
cpg members = 2,3,4

cman_ringid(h, &id)
  id = 48

cpg_ringid(h, &id1, &id2)
  id1 = 44, id2 = 48
  (there are no confchgs for the cpg in response to 5 failing)

> Turns out that libcman already has a call that returns the ring id, so all I
> need now is the addition to cpg.

Chrissie pointed out that libcman only returns the 64 bit ringid as uint32,
but I doubt we'll see ringid's bigger than that.... even if we do I'm just
comparing consecutive id's so the lower 32 bits should be fine.

Dave

_______________________________________________
Openais mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/openais

Re: [Openais] correlating events

Reply via email to