- Description has changed:

Diff:

~~~~

--- old
+++ new
@@ -39,3 +39,6 @@
 Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; 
timeout=60
 
 ~~~
+
+Update: it seems I forgot to do "./opensaf nodestop" between the two 
"./opensaf nodestart" above. Thus, there are probably two SC-2 nodes at the 
same time, and the error message "Node already exit in the cluster with smiler 
configuration" should be interpreted as "duplicate node detected in the 
network". Reducing the priority of this defect to "minor". Still two problems 
ought to be fixed: the error message should be changed so that it is clear what 
it means, and osafdtmd should not assert (it could call opensaf_reboot() if a 
there is a configuration problem, but asserting idicates a software problem).
+

~~~~

- **Priority**: major --> minor



---

** [tickets:#2030] dtm: "Node already exit in the cluster with smiler 
configuration"**

**Status:** unassigned
**Milestone:** 4.7.2
**Created:** Tue Sep 13, 2016 12:10 PM UTC by Anders Widell
**Last Updated:** Tue Sep 13, 2016 12:30 PM UTC
**Owner:** nobody


osafdtm does not handle rapid consecutive node reboots properly. I got the 
following errors in syslog:

~~~
Sep 13 14:00:52 SC-2 local0.err osafdtmd[378]: ER DTM:  Node already exit in 
the cluster with smiler configuration , correct the other joining Node 
configuration 
Sep 13 14:01:02 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed 
.node_ip: 192.168.0.1, node_id: 0
Sep 13 14:01:06 SC-2 local0.err osafdtmd[378]: ER DTM: dtm_node_add failed 
.node_ip: 192.168.0.1, node_id: 0
~~~

Here are the steps to reproduce this problem in UML:

./opensaf start
(wait until the cluster comes up)
./opensaf nodestop 2
(wait a few seconds)
./opensaf nodestart 2
./opensaf nodestart 2

The last two commands should be execute quickly after each other, maybe with 
one second delay in between them.

It seems that osafdtmd asserts and dies when this happens. Here is the result 
from a second run of the above test:

~~~
Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: ER DTM:  Node already exit in 
the cluster with smiler configuration , correct the other joining Node 
configuration 
Sep 13 14:25:58 SC-2 local0.err osafdtmd[378]: dtm_node.c:109: 
dtm_process_node_info: Assertion '0' failed.
Sep 13 14:25:58 SC-2 local0.err osafamfd[478]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafclmna[468]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafclmd[458]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafntfd[448]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osaflogd[437]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafimmnd[426]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafimmd[415]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osaffmd[405]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.err osafrded[392]: MDTM:SOCKET recd_bytes :0, conn 
lost with dh server, exiting library err :Success
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 
'SC-1'
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 
'PL-4'
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 
'PL-5'
Sep 13 14:25:58 SC-2 local0.notice osafdtmd[378]: NO Established contact with 
'PL-3'
Sep 13 14:25:59 SC-2 user.notice osafdtmd: osafdtmd Process down, Rebooting the 
node
Sep 13 14:25:59 SC-2 user.notice opensaf_reboot: Rebooting local node; 
timeout=60

~~~

Update: it seems I forgot to do "./opensaf nodestop" between the two "./opensaf 
nodestart" above. Thus, there are probably two SC-2 nodes at the same time, and 
the error message "Node already exit in the cluster with smiler configuration" 
should be interpreted as "duplicate node detected in the network". Reducing the 
priority of this defect to "minor". Still two problems ought to be fixed: the 
error message should be changed so that it is clear what it means, and osafdtmd 
should not assert (it could call opensaf_reboot() if a there is a configuration 
problem, but asserting idicates a software problem).




---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to