Hi,

Thanks.
So I have a robustness pb with Pacemaker/corosync ... you'll tell me
if it seems normal or not , if I miss something or not :
for example on two nodes, I make a script which configure 60 resnames 
with ocf pacemaker Dummy script,
(30 with INFINITY location on node1 and 30 with INFINITY location on node2)
and my test consists in a loop of [ 60 "crm resource start <resname>", 
sleep 60, 60 "crm resource stop <resname>",
sleep 60 ] and so on ...

After 2 or 3 successful loops, pacemaker systematically fails on one 
node (meaning crm_mon does does connect anymore) and this
node is fenced by the other one.  I have tried several times and it is 
systematic.

On the alive node, my script test displays :
STOP LOOP : Number 3
Stop resname1
*Call cib_replace failed (-41): Remote node did not respond*
<null>
ERROR on Stop resname1

and when the fenced node is rebooted and I look in syslog, I only have 
these lines before
the first boot Linux line :
1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [4557]: debug: 
rsc:resname22:335: monitor
1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [12818]: debug: 
perform_ra_op: resetting scheduler class to SCHED_OTHER
1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [4557]: debug: 
rsc:resname20:334: monitor
1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [4557]: debug: 
rsc:resname18:333: monitor
1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [12819]: debug: 
perform_ra_op: resetting scheduler class to SCHED_OTHER
1291896079 2010 Dec  9 13:01:19 node2 daemon debug lrmd [12820]: debug: 
perform_ra_op: resetting scheduler class to SCHED_OTHER
1291896079 2010 Dec  9 13:01:19 node2 daemon debug Dummy DEBUG: 
resname22 monitor : 0
1291896079 2010 Dec  9 13:01:19 node2 daemon debug Dummy DEBUG: 
resname18 monitor : 0
1291896079 2010 Dec  9 13:01:19 node2 daemon debug Dummy DEBUG: 
resname20 monitor : 0
1291896080 2010 Dec  9 13:01:20 node2 daemon debug lrmd [4557]: debug: 
rsc:resname60:373: monitor
1291896080 2010 Dec  9 13:01:20 node2 daemon debug lrmd [12839]: debug: 
perform_ra_op: resetting scheduler class to SCHED_OTHER
1291896080 2010 Dec  9 13:01:20 node2 daemon debug Dummy DEBUG: 
resname60 monitor : 0
1291896082 2010 Dec  9 13:01:22 node2 daemon err corosync   [TOTEM ] 
FAILED TO RECEIVE
1291896082 2010 Dec  9 13:01:22 node2 daemon debug corosync   [TOTEM ] 
entering GATHER state from 6.
1291896328 2010 Dec  9 13:05:28 node2 syslog notice syslog-ng syslog-ng 
starting up; version='3.0.3'
1291896328 2010 Dec  9 13:05:28 node2 kern info kernel Initializing 
cgroup subsys cpuset
1291896328 2010 Dec  9 13:05:28 node2 kern info kernel Initializing 
cgroup subsys cpu
1291896328 2010 Dec  9 13:05:28 node2 kern notice kernel Linux version 
2.6.32-30.el6. .... etc.

Releases :
corosync-1.2.3-21.el6.x86_64
pacemaker-1.1.2-2.el6.x86_64

Alain

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to