Hans N and I made some progress on this one. Analysis:
Test case: cluster restart
* Due to the #540 bug the old active controller "hangs" in shutdown. The state
here is probably that the RDE process is killed.
* The old standby controller is rebooted and comes up, RDE detects no peer an
assumes ACTIVE role.
* When the above happens MDS informs the old active amfd process that it is now
QUIESCED. From the MDS documentation:
"MDS_CALLBACK_QUIESCED_ACK
This callback informs the MDS Client that it is being moved to standby
HA-state. The callback may be due to the result of the MDS Client switching to
quiesced HA-state or due to a competing MDS-Client showing up in active
HA-state."
the last part is what happens here.
* This fools the old active amfd into thinking a controller switch-over is
going on. And it calls saImmOiImplementerClear(). This call eventually times
out and abort() is called. The reason it fails is that the local immnd process
(and immd) has already been killed.
This is basically a controller split brain. RDE should optimally fence the peer
controller before going active. At least it could add some more evidence into
the decision. There are already artifacts for RDE.
The proposed small change in amfd is to check that switch-over is pending in
the avd_mds_qsd_role_evh() event handler. If not just log and exit. amfnd will
detect that and reboot the node. No core dump generated.
Should even be easy to reproduce.
---
** [tickets:#516] Amfd: calling immutil_saImmOiImplementerClear in
avd_mds_qsd_role_evh leads to amfnd sending SIGABRT to amfd**
**Status:** assigned
**Created:** Tue Jul 23, 2013 02:39 PM UTC by hano
**Last Updated:** Wed Aug 14, 2013 11:42 AM UTC
**Owner:** Praveen
osafamfd is "supervised" by osafamfnd through osafamfd is sending "heartbeats"
to osafamfnd. If no "heartbeats" are recievied within one minute, osafamfnd
will send an abort signal to osafamfd which then will abort, (produce an core
dump and exit). The reason why osafamfd is not sending any "heartbeats" below
is due to that osafamfd has got a role change message from MDS (Active to
Quiesced) and calls immutil_saImmOiImplementerClear. IMM is not responding,
osafamfd waits and is not sending any "heartbeats" and will be aborted by
osafamfnd.
There are several cases with this behavior and amfd should not call
immutil_saImmOiImplementerClear but instead call saImmOiImplementerClear and
handle the return code and retry logic in avd_main_proc poll loop instead
to avoid these core dumps and make amf responsive.
---
Core was generated by `/usr/lib64/opensaf/osafamfd'.
Program terminated with signal 6, Aborted.
#0 0x00007f08e45b6dfd in nanosleep () from /lib64/libc.so.6
(gdb) bt full
#0 0x00007f08e45b6dfd in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f08e45e2824 in usleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x0000000000407506 in immutil_saImmOiImplementerClear
(immOiHandle=94489411855) at ../../../../../osaf/tools/safimm/src/immutil.c:1042
rc = <optimized out>
nTries = 54
#3 0x000000000043492a in avd_mds_qsd_role_evh (cb=0x69c980, evt=<optimized
out>) at avd_role.c:573
status = <optimized out>
rc = <optimized out>
__FUNCTION__ = <error reading variable>
#4 0x000000000043341d in avd_process_event (cb_now=0x69c980, evt=0x7ff160) at
avd_proc.c:591
__FUNCTION__ = <error reading variable>
#5 0x00000000004336a1 in avd_main_proc () at avd_proc.c:507
pollretval = <optimized out>
cb = 0x69c980
evt = 0x7ff160
mbx_fd = <optimized out>
error = <optimized out>
polltmo = -1
#6 0x00000000004096bd in main (argc=<optimized out>, argv=<optimized out>) at
amfd_main.c:47
error = 0
node_id = <optimized out>
(gdb) quit
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite!
It's a free troubleshooting tool designed for production.
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://pubads.g.doubleclick.net/gampad/clk?id=48897031&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets