Re: [Linux-HA] Heartbeat 3.0.3 stable version + RHEL 6.1: restart network will make heartbeat not send broadcasts

2011-07-18 Thread Ai Lei
Hi:

I'm using Heartbeat 3.0.3 stable version on RHEL 6.1 x64 platform, and found
following issue:
If I restart network service, heartbeat will not send broadcast packages
from port 694. That makes this node never have a chance to join HA cluster
again except restart it.

Details for setting cluster:

1. Compile heartbeat 3.0.3 from source and install it on 2 RHEL 6.1 x64
nodes: installer001 and rhel61
2. Compile pacemaker 1.0.9 from source and install it on 2 RHEL 6.1 x64
nodes
3. Configure /etc/ha.d/ha.cf, make sure  both of these 2 nodes are Online
through crm status
4. run tcpdump -i eth0 port 694, we can found both of these 2 nodes are
sending heartbeat broadcast packages.

Details of configuration file:
=
[root@rhel61 ~]# cat /etc/ha.d/ha.cf
autojoin none
bcast eth0
warntime 5
deadtime 15
initdead 60
keepalive 2
node installer001
node rhel61
crm respawn


Then I tried to restart network service on the backup node installer001,
or just run ifdown eth0; ifup eth0. And on node rhel61 it will detected
installer001 as offline immediately. On node installer001, it will
detect rhel61 as offline.
Then I run tcpdump -i eth0 port 694 on installer001 again, we can only
detect rhel61 still sending broadcast packages but no broadcast packages
coming from installer001, although eth0 network is fully recovered now.

I've tried the exactly same case on RHEL 5.6 (heartbeat 3.0.3), it works
well. After restart network, the node can still send out broadcast
packages...

Thanks for you comments.
--Lei
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith with external/vcenter

2011-07-18 Thread Andrew Beekhof
On Mon, Jul 18, 2011 at 3:25 PM, lowshoe lows...@gmail.com wrote:

 hi guys,

 help for this problem is still greatly appreciated! i can give more info,
 logmessages or configs if needed.

It might be best to contact SUSE support.  They'll be able to give
this a higher priority.


 regards, lowshoe.



 lowshoe wrote:

 hi guys,

 i already use the nice external/vcenter stonith-plugin in an Ubuntu
 10.04.2 LTS-based 2-node-cluster where it works like a charm.
 now i wanted to use it with the same configuration on a SLES 11 SP1-based
 2-node-cluster.

 the commandline-test directly with stonith succeeds:

 stonith -t external/vcenter VI_SERVER=*.*.*.*
 VI_CREDSTORE=7path/to/vicredentials.xml
 HOSTLIST=hostname1=vmdb1n1;hostname2=vmdb1n2 RESETPOWERON=0 -lS

 ** INFO: Cannot get parameter VI_PORTNUMBER from StonithNVpair
 ** INFO: Cannot get parameter VI_PROTOCOL from StonithNVpair
 ** INFO: Cannot get parameter VI_SERVICEPATH from StonithNVpair
 stonith: external/vcenter device OK.
 hostname1
 hostname2

 but when i try to get it working as a pacemaker resource, i get errors
 when trying to start the resource. this is the config:

 crm configure primitive shoot-node1 stonith:external/vcenter \
   params VI_SERVER=*.*.*.* VI_CREDSTORE=/path/to/vicredentials.xml \
   HOSTLIST=node1=vm1  RESETPOWERON=0  op monitor interval=60s

 crm configure primitive shoot-node2 stonith:external/vcenter \
    params VI_SERVER=*.*.*.* VI_CREDSTORE=/path/to/vicredentials.xml \
    HOSTLIST=node2=vm2 RESETPOWERON=0  op monitor interval=60s


 location shoot-node1-placement shoot-node1 \
         rule $id=shoot-node1-placement-rule -inf: #uname ne node1
 location shoot-node2-placement shoot-node2 \
         rule $id=shoot-node2-placement-rule -inf: #uname ne node2

 and this are the errors i get:

 in crm_mon:
    shoot-node1     (stonith:external/vcenter):     Started node2
 Failed actions:
     shoot-node1_monitor_6 (node=node2, call=40, rc=1,
 status=complete): unknown error


 in /var/log/messages:

 Jul 14 15:47:49 node2 lrmd: [3655]: info: rsc:shoot-node1:27: start
 Jul 14 15:47:51 node2 lrmd: [3655]: info: stonithRA plugin: got metadata:
 [..]
 Jul 14 15:47:51 node2 lrmd: [3655]: WARN: G_SIG_dispatch: Dispatch
 function for SIGCHLD was delayed 1290 ms ( 100 ms) before being called
 (GSource: 0x6192c0)
 Jul 14 15:47:51 node2 lrmd: [3655]: info: G_SIG_dispatch: started at
 1718940021 should have started at 1718939892
 Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:28: monitor
 Jul 14 15:47:51 node2 stonith: external/vcenter device not accessible.
 Jul 14 15:47:51 node2 stonith-ng: [3653]: notice: log_operation: Operation
 'monitor' [20916] for device 'shoot-node1' returned: 1
 Jul 14 15:47:51 node2 lrmd: [3655]: info: cancel_op: operation monitor[28]
 on stonith::external/vcenter::shoot-node1 for client 3658, its parameters:
 HOSTLIST=[node1=vm1] VI_CREDSTORE=[/path/to/c
 redstore/vicredentials.xml] VI_SERVER=[*.*.*.*] RESETPOWERON=[0]
 crm_feature_set=[3.0.2] CRM_meta_name=[monitor] CRM_meta_timeout=[2]
 CRM_meta_interval=[6]  cancelled
 Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:29: stop
 Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:30: start
 Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:31: monitor
 Jul 14 15:47:51 node2 stonith: external/vcenter device not accessible.

 why does this work on ubuntu but not on sles?

 on ubuntu i use Corosync Cluster Engine, version '1.2.0', on sles  i use
 Corosync Cluster Engine, version '1.2.7'. could the version-difference be
 the reason?


 regards, lowshoe




 --
 View this message in context: 
 http://old.nabble.com/stonith-with-external-vcenter-tp32061530p32080744.html
 Sent from the Linux-HA mailing list archive at Nabble.com.

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] split brain problem

2011-07-18 Thread Andrew Beekhof
On Sat, Jul 16, 2011 at 7:31 PM, Willi Fehler willi.feh...@t-online.de wrote:
 Hi,

 I've installed a Pacemaker/OpenAIS/Corosync/DRBD/MySQL Cluster on
 CentOS6. (VirtualBox)
 If I start both nodes at the same time, I always get a split brain

Split brain as in, corosync on the two nodes can't talk to one another?

 situation, If I start
 on node and wait if the node is promoted to DRBD-Master everything is
 working. How can I tell Pacemaker which node always become master?

a location constraint with role=Master


 [root@linsrv001 ~]# crm configure show
 node linsrv001.willi-net.local
 node linsrv002.willi-net.local
 primitive drbd_mysql ocf:linbit:drbd \
     params drbd_resource=r0 \
     op monitor interval=15s
 primitive fs_mysql ocf:heartbeat:Filesystem \
     params device=/dev/drbd/by-res/r0 directory=/var/lib/mysql
 fstype=xfs
 primitive ip_mysql ocf:heartbeat:IPaddr2 \
     params ip=192.168.2.92 nic=eth0
 primitive mysqld lsb:mysql
 group mysql fs_mysql ip_mysql mysqld
 ms ms_drbd_mysql drbd_mysql \
     meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 location cli-prefer-mysql mysql \
     rule $id=cli-prefer-rule-mysql inf: #uname eq
 linsrv001.willi-net.local
 colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
 order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
 property $id=cib-bootstrap-options \
     dc-version=1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe \
     cluster-infrastructure=openais \
     expected-quorum-votes=2 \
     no-quorum-policy=ignore \
     stonith-enabled=false

 My second question is, what happens If one node fails and I have to
 setup the hole node again. If I start OpenAIS/Corosync, what happens
 with the CIB?(will the cluster information configuration will be
 transfered to the node?)

 Regards - Willi

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] The active trap of the SNMP is delayed.

2011-07-18 Thread renayama19661014
Hi All,

We are troubled in the face of this problem.
Please give advice.

* This problem changed the destination of the mailing list to seem to be a 
problem of the HA.

Best Regards,
Hideo Yamauchi.



--- On Fri, 2011/6/17, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp 
wrote:

 Hi All,
 
 I registered this problem in Bugzilla.
 
  * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2604
 
 Best Regards,
 Hideo Yamauch.
 
 --- On Wed, 2011/6/15, renayama19661...@ybb.ne.jp 
 renayama19661...@ybb.ne.jp wrote:
 
  Hi All,
  
  I found a problem with a trap of the SNMP.(from hbagent.)
  
  A trap of active of the node seems to have possibilities to be delayed.
  
  In addition, this problem sometimes occurs and does not always occur.
  
  
  I confirmed it in the next procedure.
  
  Step1) Start a node.
  
  
  Last updated: Wed Jun 15 19:23:39 2011
  Stack: Heartbeat
  Current DC: srv02 (afe72fff-b7b4-4663-b845-872df29c635d) - partition 
  WITHOUT quorum
  Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04
  2 Nodes configured, unknown expected votes
  1 Resources configured.
  
  
  Online: [ srv01 srv02 ]
  
   Resource Group: group-1
       prmDummy1  (ocf::heartbeat:Dummy): Started srv01
  
  Migration summary:
  * Node srv02: 
  * Node srv01: 
  
  
  Step2) Intercept one interface of the Heartbeat communication.
  
  # iptables -A INPUT -i eth1 -s ! 192.168.10.110 -j DROP
  # iptables -A INPUT -i eth1 -s ! 192.168.10.120 -j DROP
  
  
  Step3) The next trap is received in SNMP managers.
  
  (snip)
  Jun 15 19:24:30 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:30 UNKNOWN 
  [UDP: [192.168.40.120]:59010]: DISMAN-EVENT-MIB::sysUpTimeInstance = 
  Timeticks: (23014) 0:03:50.14       SNMPv2-MIB::snmpTrapOID.0 = OID: 
  LINUX-HA-MIB::LHAIFStatusUpdate        LINUX-HA-MIB::LHANodeName = STRING: 
  srv01       LINUX-HA-MIB::LHAIFName = STRING: eth1       
  LINUX-HA-MIB::LHAIFStatus = INTEGER: down(2) 
      No problem.
  Jun 15 19:24:32 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:32 UNKNOWN 
  [UDP: [192.168.40.110]:44001]: DISMAN-EVENT-MIB::sysUpTimeInstance = 
  Timeticks: (23597) 0:03:55.97       SNMPv2-MIB::snmpTrapOID.0 = OID: 
  LINUX-HA-MIB::LHANodeStatusUpdate      LINUX-HA-MIB::LHANodeName = STRING: 
  srv02       LINUX-HA-MIB::LHANodeStatus = INTEGER: active(3)
      The trap of active is improper in this timing.
  Jun 15 19:24:34 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:34 UNKNOWN 
  [UDP: [192.168.40.110]:44001]: DISMAN-EVENT-MIB::sysUpTimeInstance = 
  Timeticks: (23803) 0:03:58.03       SNMPv2-MIB::snmpTrapOID.0 = OID: 
  LINUX-HA-MIB::LHAIFStatusUpdate        LINUX-HA-MIB::LHANodeName = STRING: 
  srv02       LINUX-HA-MIB::LHAIFName = STRING: eth1       
  LINUX-HA-MIB::LHAIFStatus = INTEGER: down(2) 
      No problem.
  (snip)
  
  Between the traps which interface intercepted, it is strange that the 
  active trap of the node comes.
  
  And I think that it is necessary for the active trap to be sent in an 
  earlier timing.
  
  
  This problem seems to happen in Heartbeat2.1.4.
  
  I watched some sources, but think that client_lib of Heartbeat has a 
  problem somehow or other.
  Transmitted F_STATUS message is late and seems to be handled.
  
  
  Best Regards,
  Hideo Yamauchi.
  
 
 
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems