- **Type**: defect --> enhancement
---
** [tickets:#398] AMF internal HA issues**
**Status:** unassigned
**Milestone:** future
**Created:** Fri May 31, 2013 05:21 AM UTC by Nagendra Kumar
**Last Updated:** Tue Feb 10, 2015 06:45 AM UTC
**Owner:** nobody
Migrated from http://devel.opensaf.org/ticket/2636
For example:
amfd sends a REGSU(DN) message to amfnd
amfnd fails and responds with not OK
Which results in:
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_ndproc.c:159: 2
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_data_update_req_evh: invalid node
state 3
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_msg_sanity_chk: invalid msg id 53,
from 2020f should be 52
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_msg_sanity_chk: invalid msg id 54,
from 2020f should be 52
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_msg_sanity_chk: invalid msg id 55,
from 2020f should be 52
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_msg_sanity_chk: invalid msg id 56,
from 2020f should be 52
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_msg_sanity_chk: invalid msg id 57,
from 2020f should be 52
Apr 25 14:36:05 debian1 osafamfd[1377]: avd_msg_sanity_chk: invalid msg id 58,
a node in limbo condition!
Checkout this code in avd_ndproc.c when this is discovered:
void avd_node_down_func(AVD_CL_CB *cb, AVD_AVND *avnd)
{
TRACE_ENTER();
/*TODO*/
// opensaf_reboot(avnd->node_info.nodeId,
// avnd->node_info.executionEnvironment.value,
// "Making the node down");
if (avnd->node_state != AVD_AVND_STATE_SHUTTING_DOWN) {
avd_node_state_set(avnd, AVD_AVND_STATE_GO_DOWN);
m_AVSV_SEND_CKPT_UPDT_ASYNC_UPDT(cb, avnd, AVSV_CKPT_AVND_NODE_STATE);
}
}
OpenSAF 3.0 tries to reset this node using avm and HPI.
So we could instead do:
send a REBOOT msg AND call opensaf_reboot(), that is both soft and hard fencing.
and/or the amfnd would never send a response with non succcess but instead
commit suicide.
Comments please!
Changed 13 months ago by mathi ¶
This is something similair to the Byzantine Generals Problem.
The slave is misbehaving
a) either because of a bug within itself(fault characterized locally to the
node) or
b) the slave is misbehaving because of a network or h/w issue.
In both the cases the node has to be 'evicted' (in SAF terminology).
The eviction could be done by shooting the node.
From the slave's perspective:
If the reason is a) i.e. if it can detect that(through any mechanism that the
fault is localised), then the slave commits a suicide.
If not, it shall continue and knows that the master will shoot him.
From the master's perspective:
Always shoot a misbehaving node (after expiry of any tolerance and or any such
configuration policy).
My take (in the context of our framework) is:
1) If the bug here is because of some message dropping by the receiver (AMFND)
and or misbehaving logic, we first fix it.
2) If we cannot determine that the reason is because of 1) then we commit
suicide.
3) In either case AMFD has to simply call opensaf_reboot().
The opensaf_reboot script is the placeholder for the 'middleware integrator' to
replace the PLM commands or OS commands
with their other hardware management interface commands.
For e.g.;_ if the integrator (i.e. hardware) has IPMC support, then replace any
OS/PLM command with
the following:
$ impc <reboot> <node_id>
Changed 13 months ago by mathi ¶
To prevent multiple reboots, if AMFND could respond back to AMFD with its
status, then the reboot could be initiated by AMFD. In this case, the AMFD may
might as well choose to re-transmit if its a mismatch of send/recv-id, but we
dont have re-transmit support(or may just not be necessary)... so self-kill
seems to be a better option...
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets