- **status**: unassigned --> review
- **assigned_to**: Mathi Naickan
- **Milestone**: future --> 4.4.RC1



---

** [tickets:#727] clmna core dumped on payload when the cluster is going down**

**Status:** review
**Created:** Fri Jan 17, 2014 07:47 AM UTC by Sirisha Alla
**Last Updated:** Wed Jan 29, 2014 07:18 PM UTC
**Owner:** Mathi Naickan

The issue is seen on 4 node SLES VM setup with changeset 4733 and with the 
patches corresponding to #220.

There seems to be a tipc link flap(??) which led to the reset of the cluster. 
When the payload PL-3 is going down CLMNA core dump is observed 

Syslog of SC-1:

Jan 16 11:54:24 SLES-64BIT-SLOT1 osafimmd[2555]: NO 2PBE configured with 
IMMSV_PEER_SC_MAX_WAIT: 30 seconds
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmnd[2565]: Started
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmnd[2565]: NO Persistent Back-End 
capability configured, Pbe file:imm.db (suffix may get added)
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmd[2555]: NO 2PBE wait. Passed time:3698 
new timeout: 26302 msecs
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmnd[2565]: NO SERVER STATE: 
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmd[2555]: NO 2PBE wait. Passed time:3803 
new timeout: 26197 msecs
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmnd[2565]: NO 2PBE configured, 
IMMSV_PBE_FILE_SUFFIX:.2010f (sync)
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmnd[2565]: NO SERVER STATE: 
IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmnd[2565]: NO SERVER STATE: 
IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Jan 16 11:54:28 SLES-64BIT-SLOT1 osafimmnd[2565]: NO NODE STATE-> 
IMM_NODE_ISOLATED
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmd[2555]: NO 2PBE wait. Passed time:4110 
new timeout: 25890 msecs
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmd[2555]: NO 2PBE wait. Passed time:4112 
new timeout: 25888 msecs
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmnd[2565]: NO Sync client discarded 
classimplementer set. Impl-id:1 Class:SaLogStreamConfig
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmnd[2565]: NO Sync client discarded 
classimplementer set. Impl-id:1 Class:OpenSafLogConfig
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmd[2555]: NO SBY: Ruling epoch noted 
as:59
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmd[2555]: NO IMMND coord at 2020f
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmd[2555]: NO SBY: 
SaImmRepositoryInitModeT changed and noted as 'SA_IMM_KEEP_REPOSITORY'
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmnd[2565]: NO NODE STATE-> 
IMM_NODE_W_AVAILABLE
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmnd[2565]: NO Implementer connected: 3 
(safClmService) <0, 2020f>
Jan 16 11:54:29 SLES-64BIT-SLOT1 osafimmnd[2565]: NO SERVER STATE: 
IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Resetting link 
<1.1.1:eth0-1.1.4:eth0>, requested by peer
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Lost link 
<1.1.1:eth0-1.1.4:eth0> on network plane A
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Lost contact with 
<1.1.4>
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Resetting link 
<1.1.1:eth0-1.1.3:eth0>, requested by peer
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Lost link 
<1.1.1:eth0-1.1.3:eth0> on network plane A
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Lost contact with 
<1.1.3>
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Resetting link 
<1.1.1:eth0-1.1.2:eth0>, requested by peer
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Lost link 
<1.1.1:eth0-1.1.2:eth0> on network plane A
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Lost contact with 
<1.1.2>
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Established link 
<1.1.1:eth0-1.1.4:eth0> on network plane A
Jan 16 11:54:30 SLES-64BIT-SLOT1 kernel: [   63.163540] TIPC: Established link 
<1.1.1:eth0-1.1.2:eth0> on network plane A
Jan 16 11:54:30 SLES-64BIT-SLOT1 osaffmd[2545]: NO Role: STANDBY, Node Down for 
node id: 2020f
Jan 16 11:54:30 SLES-64BIT-SLOT1 osaffmd[2545]: Rebooting OpenSAF NodeId = 0 EE 
Name = No EE Mapped, Reason: Failover occurred, but this node is not yet ready, 
OwnNodeId = 131343, SupervisionTime = 60

Syslog of SC-2:

Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmd[2343]: NO New IMMND process is on 
STANDBY Controller at 2010f
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmd[2343]: NO Extended intro from node 
2010f
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmpbed: IN arg[0] == 
'/usr/lib64/opensaf/osafimmpbed'
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmpbed: IN arg[1] == '--pbe2A'
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmpbed: IN arg[2] == 
'/home/sirisha/immsv/immpbe/imm.db.2020f'
Jan 16 11:54:20 SLES-64BIT-SLOT2 osaflogd[2385]: Started
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmpbed: IN Generating DB file from 
current IMM state. DB file: /home/sirisha/immsv/immpbe/imm.db.2020f
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmpbed: NO Successfully opened empty 
local sqlite pbe file /tmp/imm.db.O23tIO
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmd[2343]: WA IMMND on controller (not 
currently coord) requests sync
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmd[2343]: NO Node 2010f request sync 
sync-pid:2565 epoch:0
Jan 16 11:54:20 SLES-64BIT-SLOT2 osaflogd[2385]: NO log root directory is: 
/var/log/opensaf/saflog
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer connected: 1 
(safLogService) <7, 2020f>
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO implementer for class 
'SaLogStreamConfig' is safLogService => class extent is safe.
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO implementer for class 
'OpenSafLogConfig' is safLogService => class extent is safe.
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafntfd[2401]: Started
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer (applier) 
connected: 2 (@OpenSafImmReplicatorA) <16, 2020f>
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafntfimcnd[2408]: NO Started
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafclmd[2415]: Started
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Announce sync, epoch:59
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO SERVER STATE: 
IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO NODE STATE-> 
IMM_NODE_R_AVAILABLE
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmd[2343]: NO Successfully announced 
sync. New ruling epoch:59
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer connected: 3 
(safClmService) <18, 2020f>
Jan 16 11:54:20 SLES-64BIT-SLOT2 osafimmloadd: NO Sync starting
Jan 16 11:54:21 SLES-64BIT-SLOT2 osafimmnd[2353]: WA Cannot allow official 
dump/backup when imm-sync is in progress
Jan 16 11:54:22 SLES-64BIT-SLOT2 osafimmnd[2353]: WA Cannot allow official 
dump/backup when imm-sync is in progress
Jan 16 11:54:23 SLES-64BIT-SLOT2 osafimmnd[2353]: WA Cannot allow official 
dump/backup when imm-sync is in progress
Jan 16 11:54:23 SLES-64BIT-SLOT2 osaffmd[2333]: NO Role: ACTIVE, Node Down for 
node id: 2010f
Jan 16 11:54:23 SLES-64BIT-SLOT2 osaffmd[2333]: Rebooting OpenSAF NodeId = 0 EE 
Name = No EE Mapped, Reason: Failover occurred, but this node is not yet ready, 
OwnNodeId = 131599, SupervisionTime = 60
Jan 16 11:54:23 SLES-64BIT-SLOT2 kernel: [   95.696195] TIPC: Resetting link 
<1.1.2:eth0-1.1.1:eth0>, peer not responding


Syslog of PL-3:

Jan 16 11:54:33 SLES-64BIT-SLOT3 kernel: [  432.560674] TIPC: Established link 
<1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: WA DISCARD DUPLICATE FEVS 
message:24684
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: WA DISCARD DUPLICATE FEVS 
message:24763
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: WA DISCARD DUPLICATE FEVS 
message:24764
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: NO Global discard node 
received for nodeId:2020f pid:2353
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: NO Implementer disconnected 1 
<0, 2020f(down)> (safLogService)
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: NO Implementer disconnected 3 
<0, 2020f(down)> (safClmService)
Jan 16 11:54:33 SLES-64BIT-SLOT3 osafimmnd[4300]: NO Implementer disconnected 2 
<0, 2020f(down)> (@OpenSafImmReplicatorA)
Jan 16 11:54:34 SLES-64BIT-SLOT3 osafimmnd[4300]: WA DISCARD DUPLICATE FEVS 
message:24765
Jan 16 11:54:34 SLES-64BIT-SLOT3 osafimmnd[4300]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 16 11:54:34 SLES-64BIT-SLOT3 osafimmnd[4300]: WA DISCARD DUPLICATE FEVS 
message:24778
Jan 16 11:54:34 SLES-64BIT-SLOT3 osafimmnd[4300]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 16 11:54:34 SLES-64BIT-SLOT3 osafimmnd[4300]: WA DISCARD DUPLICATE FEVS 
message:24779
Jan 16 11:54:34 SLES-64BIT-SLOT3 osafimmnd[4300]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 16 11:54:34 SLES-64BIT-SLOT3 osafimmnd[4300]: NO Global discard node 
received for nodeId:2020f pid:0
Jan 16 11:54:35 SLES-64BIT-SLOT3 osafimmnd[4300]: ER IMMND forced to restart on 
order from IMMD, exiting
Jan 16 11:54:39 SLES-64BIT-SLOT3 osafclmna[4325]: ER Exiting
Jan 16 11:54:39 SLES-64BIT-SLOT3 opensafd[4230]: ER Failed   DESC:CLMNA
Jan 16 11:54:39 SLES-64BIT-SLOT3 opensafd[4230]: ER Going for recovery
Jan 16 11:54:39 SLES-64BIT-SLOT3 opensafd[4230]: ER Trying To RESPAWN 
/usr/lib64/opensaf/clc-cli/osaf-clmna attempt #1
Jan 16 11:54:39 SLES-64BIT-SLOT3 opensafd[4230]: ER Sending SIGKILL to CLMNA, 
pid=4320
Jan 16 11:54:40 SLES-64BIT-SLOT3 kernel: [  439.296429] TIPC: Resetting link 
<1.1.3:eth0-1.1.2:eth0>, peer not responding
Jan 16 11:54:40 SLES-64BIT-SLOT3 kernel: [  439.296460] TIPC: Lost link 
<1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 16 11:54:40 SLES-64BIT-SLOT3 kernel: [  439.296469] TIPC: Lost contact with 
<1.1.2>
Jan 16 11:54:40 SLES-64BIT-SLOT3 kernel: [  439.548194] TIPC: Resetting link 
<1.1.3:eth0-1.1.1:eth0>, peer not responding
Jan 16 11:54:40 SLES-64BIT-SLOT3 kernel: [  439.548208] TIPC: Lost link 
<1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 16 11:54:40 SLES-64BIT-SLOT3 kernel: [  439.548221] TIPC: Lost contact with 
<1.1.1>
Jan 16 11:54:54 SLES-64BIT-SLOT3 osafclmna[4348]: Started
Jan 16 11:55:23 SLES-64BIT-SLOT3 kernel: [  482.130080] TIPC: Established link 
<1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 16 11:55:26 SLES-64BIT-SLOT3 kernel: [  485.847290] TIPC: Established link 
<1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 16 11:55:34 SLES-64BIT-SLOT3 opensafd[4230]: ER Timed-out for response from 
CLMNA
Jan 16 11:55:34 SLES-64BIT-SLOT3 opensafd[4230]: ER Could Not RESPAWN CLMNA
Jan 16 11:55:34 SLES-64BIT-SLOT3 opensafd[4230]: ER
Jan 16 11:55:34 SLES-64BIT-SLOT3 opensafd[4230]: ER Trying To RESPAWN 
/usr/lib64/opensaf/clc-cli/osaf-clmna attempt #2
Jan 16 11:55:34 SLES-64BIT-SLOT3 opensafd[4230]: ER Sending SIGKILL to CLMNA, 
pid=4343
Jan 16 11:55:34 SLES-64BIT-SLOT3 osafclmna[4348]: exiting on signal 15
Jan 16 11:55:49 SLES-64BIT-SLOT3 osafclmna[4375]: Started
Jan 16 11:56:27 SLES-64BIT-SLOT3 osafclmna[4375]: NO 
safNode=PL-3,safCluster=myClmCluster Joined cluster, nodeid=2030f
Jan 16 11:56:27 SLES-64BIT-SLOT3 osafamfnd[4396]: Started
Jan 16 12:06:34 SLES-64BIT-SLOT3 kernel: [ 1153.316104] TIPC: Resetting link 
<1.1.3:eth0-1.1.4:eth0>, peer not responding
Jan 16 12:06:34 SLES-64BIT-SLOT3 kernel: [ 1153.316111] TIPC: Lost link 
<1.1.3:eth0-1.1.4:eth0> on network plane A
Jan 16 12:06:34 SLES-64BIT-SLOT3 kernel: [ 1153.316116] TIPC: Lost contact with 
<1.1.4>
Jan 16 12:09:07 SLES-64BIT-SLOT3 osafamfnd[4396]: saImmOmInitialize FAILED, rc 
= 6
Jan 16 12:12:57 SLES-64BIT-SLOT3 opensafd[4230]: ER Timed-out for response from 
AMFND
Jan 16 12:12:57 SLES-64BIT-SLOT3 opensafd[4230]: ER
Jan 16 12:12:57 SLES-64BIT-SLOT3 opensafd[4230]: ER Going for recovery
Jan 16 12:12:57 SLES-64BIT-SLOT3 osafclmna[4375]: exiting on signal 15

At 11:54 on PL-3 crash is observed. Following is the backtrace of clmna crash:

Core was generated by `/usr/lib64/opensaf/osafclmna --tracemask=0xffffffff'.
Program terminated with signal 6, Aborted.
  #0  0x00007f6a3a0ccb55 in raise () from /lib64/libc.so.6
(gdb) bt
  #0  0x00007f6a3a0ccb55 in raise () from /lib64/libc.so.6
  #1  0x00007f6a3a0ce131 in abort () from /lib64/libc.so.6
  #2  0x00007f6a3a109c2f in __libc_message () from /lib64/libc.so.6
  #3  0x00007f6a3a10f358 in malloc_printerr () from /lib64/libc.so.6
  #4  0x00007f6a3a1142fc in free () from /lib64/libc.so.6
  #5  0x000000000040251b in clmna_process_mbx (mbx=<optimized out>) at 
main.c:515
  #6  0x0000000000402c12 in main (argc=<optimized out>, argv=<optimized out>) 
at main.c:634
(gdb) thread apply all bt

Thread 3 (Thread 0x7f6a3b0afb00 (LWP 4328)):
  #0  0x00007f6a3a1684f6 in poll () from /lib64/libc.so.6
  #1  0x00007f6a3aca49ae in mdtm_process_recv_events () at mds_dt_tipc.c:580
  #2  0x00007f6a3a4157b6 in start_thread () from /lib64/libpthread.so.0
  #3  0x00007f6a3a1719cd in clone () from /lib64/libc.so.6
  #4  0x0000000000000000 in ?? ()

Thread 2 (Thread 0x7f6a3b0deb00 (LWP 4327)):
  #0  0x00007f6a3a1684f6 in poll () from /lib64/libc.so.6
  #1  0x00007f6a3ac695ba in osaf_poll_no_timeout (io_fds=0x7f6a3b0de290, 
i_nfds=1) at osaf_poll.c:31
  #2  0x00007f6a3ac697b5 in osaf_ppoll (io_fds=0x7f6a3b0de290, i_nfds=1, 
i_timeout_ts=0xffffffffffffffff, i_sigmask=0xffffffffffffffff) at osaf_poll.c:78
  #3  0x00007f6a3ac6fe2f in ncs_tmr_wait () at sysf_tmr.c:411
  #4  0x00007f6a3a4157b6 in start_thread () from /lib64/libpthread.so.0
  #5  0x00007f6a3a1719cd in clone () from /lib64/libc.so.6
  #6  0x0000000000000000 in ?? ()

Thread 1 (Thread 0x7f6a3b0b2700 (LWP 4325)):
  #0  0x00007f6a3a0ccb55 in raise () from /lib64/libc.so.6
  #1  0x00007f6a3a0ce131 in abort () from /lib64/libc.so.6
  #2  0x00007f6a3a109c2f in __libc_message () from /lib64/libc.so.6
  #3  0x00007f6a3a10f358 in malloc_printerr () from /lib64/libc.so.6
  #4  0x00007f6a3a1142fc in free () from /lib64/libc.so.6
  #5  0x000000000040251b in clmna_process_mbx (mbx=<optimized out>) at 
main.c:515
  #6  0x0000000000402c12 in main (argc=<optimized out>, argv=<optimized out>) 
at main.c:634
(gdb) q

Attached the CLMNA traces. This issue may not be reproducible.


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to