- **status**: fixed --> duplicate
- **Milestone**: future --> never


---

** [tickets:#237] mds : tcp  Opensaf fails to come up on payloads after double 
fault**

**Status:** duplicate
**Milestone:** never
**Created:** Thu May 16, 2013 06:11 AM UTC by A V Mahesh (AVM)
**Last Updated:** Mon Jan 12, 2015 11:36 AM UTC
**Owner:** A V Mahesh (AVM)

Migrated from http://devel.opensaf.org/ticket/3125


Changeset : 4200
Transport : TCP/ipv6 ( link local )
patches : 2794
PBE enabled.
Model : 2n
configuration : 1SG,2SUs,4comps in each su, 4Sis with 1csi each.
SU1 is hosted on PL-3 and SU2 on PL-4
SI1 is sponsor and SI2,3&4 are dependent on SI1


scenario:


Perform si-swap on any si, make sure the Sponsor SI rejects the quiesced 
assignment at SU1 and also the new active assignment on SU2, resulting in a 
double fault. The Recovery is NODE_FAILOVER. Both the payloads go for reboot 
and fail to come up even after 10mins. 


/var/log/messages on pl-3:
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: NO Assigning 
'safSi=SI1,safApp=test2nApp' QUIESCED to 'safSu=SU1,safSg=SG,safApp=test2nApp'
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: NO 
'safComp=COMP1,safSu=SU1,safSg=SG,safApp=test2nApp' faulted due to 
'csiSetcallbackFailed' : Recovery is 'nodeFailover'
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: NO Terminating all application 
components (abruptly & unordered)
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: IN 
'safComp=COMP1,safSu=SU1,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: NO 
'safSu=SU1,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED => TERMINATING
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: IN 
'safComp=COMP2,safSu=SU1,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: IN 
'safComp=COMP3,safSu=SU1,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING
Apr 22 11:27:06 OEL-64BIT-SLOT2 osafamfnd[1431]: IN 
'safComp=COMP4,safSu=SU1,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING


...
...


Apr 22 11:27:51 OEL-64BIT-SLOT2 osafimmnd[1390]: NO Persistent Back-End 
capability configured, Pbe file:imm.db
Apr 22 11:27:51 OEL-64BIT-SLOT2 osafdtmd[1374]: NO Established contact with 
'PL-5'
Apr 22 11:27:51 OEL-64BIT-SLOT2 osafdtmd[1374]: NO Established contact with 
'SC-2'
Apr 22 11:27:53 OEL-64BIT-SLOT2 osafdtmd[1374]: NO Established contact with 
'PL-4'
Apr 22 11:35:51 OEL-64BIT-SLOT2 opensafd[1367]: ER Timed-out for response from 
IMMND
Apr 22 11:35:51 OEL-64BIT-SLOT2 opensafd[1367]: ER
Apr 22 11:35:51 OEL-64BIT-SLOT2 opensafd[1367]: ER Going for recovery
Apr 22 11:35:51 OEL-64BIT-SLOT2 opensafd[1367]: ER Trying To RESPAWN 
/usr/lib64/opensaf/clc-cli/osaf-immnd attempt #1
Apr 22 11:35:51 OEL-64BIT-SLOT2 opensafd[1367]: ER Sending SIGKILL to IMMND, 
pid=1381
Apr 22 11:36:06 OEL-64BIT-SLOT2 osafimmnd[2498]: Started
Apr 22 11:36:06 OEL-64BIT-SLOT2 osafimmnd[2498]: NO Persistent Back-End 
capability configured, Pbe file:imm.db
Apr 22 11:38:24 OEL-64BIT-SLOT2 kernel: hrtimer: interrupt took 4799281 ns
Apr 22 11:44:06 OEL-64BIT-SLOT2 opensafd[1367]: ER Timed-out for response from 
IMMND
Apr 22 11:44:06 OEL-64BIT-SLOT2 opensafd[1367]: ER Could Not RESPAWN IMMND
Apr 22 11:44:06 OEL-64BIT-SLOT2 opensafd[1367]: ER
Apr 22 11:44:06 OEL-64BIT-SLOT2 opensafd[1367]: ER Trying To RESPAWN 
/usr/lib64/opensaf/clc-cli/osaf-immnd attempt #2
Apr 22 11:44:06 OEL-64BIT-SLOT2 opensafd[1367]: ER Sending SIGKILL to IMMND, 
pid=2493
Apr 22 11:44:21 OEL-64BIT-SLOT2 osafimmnd[3613]: Started
Apr 22 11:44:21 OEL-64BIT-SLOT2 osafimmnd[3613]: NO Persistent Back-End 
capability configured, Pbe file:imm.db
Apr 22 11:52:21 OEL-64BIT-SLOT2 opensafd[1367]: ER Timed-out for response from 
IMMND
Apr 22 11:52:21 OEL-64BIT-SLOT2 opensafd[1367]: ER Could Not RESPAWN IMMND
Apr 22 11:52:21 OEL-64BIT-SLOT2 opensafd[1367]: ER
Apr 22 11:52:21 OEL-64BIT-SLOT2 opensafd[1367]: ER FAILED TO RESPAWN
Apr 22 11:52:21 OEL-64BIT-SLOT2 osafimmnd[3613]: ER MDTM:socket_recv() = 0, 
conn lost with dh server, exiting library err :Success


/var/log/messages on pl-4:
Apr 22 11:27:08 OEL-64BIT-SLOT2 osafamfnd[1332]: NO 
'safComp=COMP3,safSu=SU2,safSg=SG,safApp=test2nApp' faulted due to 
'csiSetcallbackFailed' : Recovery is 'nodeFailover'
Apr 22 11:27:08 OEL-64BIT-SLOT2 osafamfnd[1332]: NO Terminating all application 
components (abruptly & unordered)
Apr 22 11:27:08 OEL-64BIT-SLOT2 osafamfnd[1332]: IN 
'safComp=COMP1,safSu=SU2,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING
Apr 22 11:27:08 OEL-64BIT-SLOT2 osafamfnd[1332]: NO 
'safSu=SU2,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED => TERMINATING
Apr 22 11:27:08 OEL-64BIT-SLOT2 osafamfnd[1332]: IN 
'safComp=COMP2,safSu=SU2,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING
Apr 22 11:27:08 OEL-64BIT-SLOT2 osafamfnd[1332]: IN 
'safComp=COMP3,safSu=SU2,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING
Apr 22 11:27:08 OEL-64BIT-SLOT2 osafamfnd[1332]: IN 
'safComp=COMP4,safSu=SU2,safSg=SG,safApp=test2nApp' Presence State INSTANTIATED 
=> TERMINATING
...
...
...
Apr 22 11:44:07 OEL-64BIT-SLOT2 opensafd[1324]: ER Sending SIGKILL to IMMND, 
pid=2451
Apr 22 11:44:22 OEL-64BIT-SLOT2 osafimmnd[3570]: Started
Apr 22 11:44:22 OEL-64BIT-SLOT2 osafimmnd[3570]: NO Persistent Back-End 
capability configured, Pbe file:imm.db
Apr 22 11:52:21 OEL-64BIT-SLOT2 osafdtmd[1331]: NO Lost contact with 'PL-3'
Apr 22 11:52:22 OEL-64BIT-SLOT2 opensafd[1324]: ER Timed-out for response from 
IMMND
Apr 22 11:52:22 OEL-64BIT-SLOT2 opensafd[1324]: ER Could Not RESPAWN IMMND
Apr 22 11:52:22 OEL-64BIT-SLOT2 opensafd[1324]: ER
Apr 22 11:52:22 OEL-64BIT-SLOT2 opensafd[1324]: ER FAILED TO RESPAWN
Apr 22 11:52:23 OEL-64BIT-SLOT2 osafimmnd[3570]: ER MDTM:socket_recv() = 0, 
conn lost with dh server, exiting library err :Success
Apr 22 11:52:23 OEL-64BIT-SLOT2 opensafd: Starting OpenSAF failed


Changed 3 weeks ago by nagendra ¶
  This looks dtm problem. At ctr1, it is dropping the packets and immnd is not 
syncing:


Apr 22 11:27:51.010741 osafdtmd [1297:dtm_node_sockets.c:1342] >> 
dtm_dgram_recvfrom_bmcast
Apr 22 11:27:51.024279 osafdtmd [1297:dtm_node_sockets.c:1374] << 
dtm_dgram_recvfrom_bmcast: rc :58
Apr 22 11:27:51.024346 osafdtmd [1297:dtm_node_sockets.c:1098] >> 
dtm_process_connect
Apr 22 11:27:51.024367 osafdtmd [1297:dtm_node_sockets.c:1123] TR mcast flag: 0
Apr 22 11:27:51.024376 osafdtmd [1297:dtm_node_db.c:0134] >> dtm_node_get_by_id
Apr 22 11:27:51.024385 osafdtmd [1297:dtm_node_db.c:0145] << dtm_node_get_by_id
Apr 22 11:27:51.024408 osafdtmd [1297:dtm_node_sockets.c:1150] TR DTM:new_node 
node already discovered droping message
Apr 22 11:27:51.024420 osafdtmd [1297:dtm_node_sockets.c:1151] << 
dtm_process_connect: sock_desc :-1
Apr 22 11:27:51.261437 osafdtmd [1297:dtm_node_sockets.c:1342] >> 
dtm_dgram_recvfrom_bmcast
Apr 22 11:27:51.261556 osafdtmd [1297:dtm_node_sockets.c:1374] << 
dtm_dgram_recvfrom_bmcast: rc :58
Apr 22 11:27:51.261581 osafdtmd [1297:dtm_node_sockets.c:1098] >> 
dtm_process_connect
Apr 22 11:27:51.261600 osafdtmd [1297:dtm_node_sockets.c:1123] TR mcast flag: 0
Apr 22 11:27:51.261617 osafdtmd [1297:dtm_node_db.c:0134] >> dtm_node_get_by_id
Apr 22 11:27:51.261636 osafdtmd [1297:dtm_node_db.c:0145] << dtm_node_get_by_id
Apr 22 11:27:51.261677 osafdtmd [1297:dtm_node_sockets.c:1150] TR DTM:new_node 
node already discovered droping message
Apr 22 11:27:51.261695 osafdtmd [1297:dtm_node_sockets.c:1151] << 
dtm_process_connect: sock_desc :-1
Apr 22 11:27:51.514444 osafdtmd [1297:dtm_node_sockets.c:1342] >> 
dtm_dgram_recvfrom_bmcast
Apr 22 11:27:51.514573 osafdtmd [1297:dtm_node_sockets.c:1374] << 
dtm_dgram_recvfrom_bmcast: rc :58
Apr 22 11:27:51.514608 osafdtmd [1297:dtm_node_sockets.c:1098] >> 
dtm_process_connect
Apr 22 11:27:51.514634 osafdtmd [1297:dtm_node_sockets.c:1123] TR mcast flag: 0
Apr 22 11:27:51.514661 osafdtmd [1297:dtm_node_db.c:0134] >> dtm_node_get_by_id
Apr 22 11:27:51.514688 osafdtmd [1297:dtm_node_db.c:0145] << dtm_node_get_by_id
Apr 22 11:27:51.514714 osafdtmd [1297:dtm_node_sockets.c:1150] TR DTM:new_node 
node already discovered droping message
Apr 22 11:27:51.514769 osafdtmd [1297:dtm_node_sockets.c:1151] << 
dtm_process_connect: sock_desc :-1
Apr 22 11:27:51.765432 osafdtmd [1297:dtm_node_sockets.c:1342] >> 
dtm_dgram_recvfrom_bmcast
Apr 22 11:27:51.765542 osafdtmd [1297:dtm_node_sockets.c:1374] << 
dtm_dgram_recvfrom_bmcast: rc :58
Apr 22 11:27:51.765562 osafdtmd [1297:dtm_node_sockets.c:1098] >> 
dtm_process_connect


Changed 3 weeks ago by nagendra ¶
  ■component changed from opensaf to infrastructure/dtms 
Changed 3 weeks ago by nagendra ¶
  ■owner changed from nagendra to mahesh 
■status changed from accepted to assigned 
Changed 9 days ago by mahesh ¶
  ■status changed from assigned to accepted 



PL-3 & PL-4 are not able to Established contact with 'SC-1' with in 5 seconds 
(dtmd.conf —> DTM_INI_DIS_TIMEOUT_SECS=5 ) . Please tune the 
DTM_INI_DIS_TIMEOUT_SECS value according to your cluster size , performance of 
system and opensaf application load .








---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to