On May 29, 2008, at 12:01 PM, Joe Bill wrote:
I would have thought it to be HB's job
to call those level 0 checks after a
resource state changing operation.
... So there are a number of reasons why
its a good idea for RAs to do this.
I understand.
It's just that following what Alan wrote a year ago:
http://www.gossamer-threads.com/lists/linuxha/users/38913#38913
What is the state of the world?
Is the state of the world a problem?
... I was left with the impression that HB wanted a
"hands-off" approach, as like having itself a strategy
to handle ambiguous "runs-unhealthy" states through
multiple "STOP" and "START" calls, scrutinized with
multiple in-between "MONITOR" calls, while you seem to
be telling me the RA has got to do whatever it needs
to hand over a "STARTED or STOPPED" state to HB,
including the "shaping" of the resource to match the
state HB thinks the resource is in, "STARTED" or
"STOPPED", depending on HB's last RA call to prevent
HB from counting a "failed" state.
No. You're reading more into it than was intended.
The context of this was that Peter was trying to out-think the cluster
and tell us what he thought we wanted to hear.
At the time I warned him this was dangerous and was proved correct
about a year later when the exact scenario I described happened.
We just want the RA to tell the truth.
If the RA says "yes, we started the service", then for your own sake,
that really should be true.
Doubly so for stop.
Likewise if a problem occurred, the RA should tell the cluster.
So a resource is either started, stopped or "failed".
There are multiple types of failures and some of which are handled
differently.
Even being "started" can be the resource is failed though - for
example if a service is active on more than one machine.
This is naturally also something we handle.
May I kindly ask you to explain the following:
1) Assuming a resource is completely stopped, what is
HB's next RA call (or sequence of calls if HB has more
than one call planned) when a START returns:
a) status=1
b) status=7
stop and possible restart. almost certainly on another machine.
c) status=2
IIRC, just stop.
Does HB attempt a STOP&START sequence of the resource,
undertaking to perform a STOP on an already stopped
resource ?
We mostly try and avoid stopping something thats already stopped.
But there are times when we need to.
The LSB and OCF specs specifically require the RA to support this.
2) Assuming a resource was previously successfully
started, what is HB's strategy and next RA call (or
sequence of calls if HB has more than one call
planned) when a routine MONITOR returns:
a) status=1
b) status=7
stop and possible restart
c) status=2
Does HB attempt a STOP&START strategy of the resource?
as above - i think just stop
Or does HB count a "failed" state for that resource
and initiates a failover ?
would be a pretty pointless cluster manager if it didnt
Note that monitor operations are optional.
If they're not defined, then they're not run.
So HB never attempts to capture a snapshot of the
state of all resources before undertaking a state
changing operation, restricting itself on ...
- relying on the state HB thinks the resources are in
after having started/stopped them (the last STOP or
START performed by HB on that resource),
you left out "and the most recent monitor calls"
we get all this data from each node's lrm and based on the order in
which operations occurred, and their results, we form a view of what
state the resource is in across the whole cluster
AND,
- reacting upon an unexpected status returned by a
scheduled MONITOR call
Is this correct ?
yes
also node-level failures
Again, thank you very much for your time reading an
replying to my concerns.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems