Re: [Linux-HA] Heartbeat 3.0.3 stable version + RHEL 6.1: restart network will make heartbeat not send broadcasts
Hi: I'm using Heartbeat 3.0.3 stable version on RHEL 6.1 x64 platform, and found following issue: If I restart network service, heartbeat will not send broadcast packages from port 694. That makes this node never have a chance to join HA cluster again except restart it. Details for setting cluster: 1. Compile heartbeat 3.0.3 from source and install it on 2 RHEL 6.1 x64 nodes: installer001 and rhel61 2. Compile pacemaker 1.0.9 from source and install it on 2 RHEL 6.1 x64 nodes 3. Configure /etc/ha.d/ha.cf, make sure both of these 2 nodes are Online through crm status 4. run tcpdump -i eth0 port 694, we can found both of these 2 nodes are sending heartbeat broadcast packages. Details of configuration file: = [root@rhel61 ~]# cat /etc/ha.d/ha.cf autojoin none bcast eth0 warntime 5 deadtime 15 initdead 60 keepalive 2 node installer001 node rhel61 crm respawn Then I tried to restart network service on the backup node installer001, or just run ifdown eth0; ifup eth0. And on node rhel61 it will detected installer001 as offline immediately. On node installer001, it will detect rhel61 as offline. Then I run tcpdump -i eth0 port 694 on installer001 again, we can only detect rhel61 still sending broadcast packages but no broadcast packages coming from installer001, although eth0 network is fully recovered now. I've tried the exactly same case on RHEL 5.6 (heartbeat 3.0.3), it works well. After restart network, the node can still send out broadcast packages... Thanks for you comments. --Lei ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith with external/vcenter
On Mon, Jul 18, 2011 at 3:25 PM, lowshoe lows...@gmail.com wrote: hi guys, help for this problem is still greatly appreciated! i can give more info, logmessages or configs if needed. It might be best to contact SUSE support. They'll be able to give this a higher priority. regards, lowshoe. lowshoe wrote: hi guys, i already use the nice external/vcenter stonith-plugin in an Ubuntu 10.04.2 LTS-based 2-node-cluster where it works like a charm. now i wanted to use it with the same configuration on a SLES 11 SP1-based 2-node-cluster. the commandline-test directly with stonith succeeds: stonith -t external/vcenter VI_SERVER=*.*.*.* VI_CREDSTORE=7path/to/vicredentials.xml HOSTLIST=hostname1=vmdb1n1;hostname2=vmdb1n2 RESETPOWERON=0 -lS ** INFO: Cannot get parameter VI_PORTNUMBER from StonithNVpair ** INFO: Cannot get parameter VI_PROTOCOL from StonithNVpair ** INFO: Cannot get parameter VI_SERVICEPATH from StonithNVpair stonith: external/vcenter device OK. hostname1 hostname2 but when i try to get it working as a pacemaker resource, i get errors when trying to start the resource. this is the config: crm configure primitive shoot-node1 stonith:external/vcenter \ params VI_SERVER=*.*.*.* VI_CREDSTORE=/path/to/vicredentials.xml \ HOSTLIST=node1=vm1 RESETPOWERON=0 op monitor interval=60s crm configure primitive shoot-node2 stonith:external/vcenter \ params VI_SERVER=*.*.*.* VI_CREDSTORE=/path/to/vicredentials.xml \ HOSTLIST=node2=vm2 RESETPOWERON=0 op monitor interval=60s location shoot-node1-placement shoot-node1 \ rule $id=shoot-node1-placement-rule -inf: #uname ne node1 location shoot-node2-placement shoot-node2 \ rule $id=shoot-node2-placement-rule -inf: #uname ne node2 and this are the errors i get: in crm_mon: shoot-node1 (stonith:external/vcenter): Started node2 Failed actions: shoot-node1_monitor_6 (node=node2, call=40, rc=1, status=complete): unknown error in /var/log/messages: Jul 14 15:47:49 node2 lrmd: [3655]: info: rsc:shoot-node1:27: start Jul 14 15:47:51 node2 lrmd: [3655]: info: stonithRA plugin: got metadata: [..] Jul 14 15:47:51 node2 lrmd: [3655]: WARN: G_SIG_dispatch: Dispatch function for SIGCHLD was delayed 1290 ms ( 100 ms) before being called (GSource: 0x6192c0) Jul 14 15:47:51 node2 lrmd: [3655]: info: G_SIG_dispatch: started at 1718940021 should have started at 1718939892 Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:28: monitor Jul 14 15:47:51 node2 stonith: external/vcenter device not accessible. Jul 14 15:47:51 node2 stonith-ng: [3653]: notice: log_operation: Operation 'monitor' [20916] for device 'shoot-node1' returned: 1 Jul 14 15:47:51 node2 lrmd: [3655]: info: cancel_op: operation monitor[28] on stonith::external/vcenter::shoot-node1 for client 3658, its parameters: HOSTLIST=[node1=vm1] VI_CREDSTORE=[/path/to/c redstore/vicredentials.xml] VI_SERVER=[*.*.*.*] RESETPOWERON=[0] crm_feature_set=[3.0.2] CRM_meta_name=[monitor] CRM_meta_timeout=[2] CRM_meta_interval=[6] cancelled Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:29: stop Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:30: start Jul 14 15:47:51 node2 lrmd: [3655]: info: rsc:shoot-node1:31: monitor Jul 14 15:47:51 node2 stonith: external/vcenter device not accessible. why does this work on ubuntu but not on sles? on ubuntu i use Corosync Cluster Engine, version '1.2.0', on sles i use Corosync Cluster Engine, version '1.2.7'. could the version-difference be the reason? regards, lowshoe -- View this message in context: http://old.nabble.com/stonith-with-external-vcenter-tp32061530p32080744.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] split brain problem
On Sat, Jul 16, 2011 at 7:31 PM, Willi Fehler willi.feh...@t-online.de wrote: Hi, I've installed a Pacemaker/OpenAIS/Corosync/DRBD/MySQL Cluster on CentOS6. (VirtualBox) If I start both nodes at the same time, I always get a split brain Split brain as in, corosync on the two nodes can't talk to one another? situation, If I start on node and wait if the node is promoted to DRBD-Master everything is working. How can I tell Pacemaker which node always become master? a location constraint with role=Master [root@linsrv001 ~]# crm configure show node linsrv001.willi-net.local node linsrv002.willi-net.local primitive drbd_mysql ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=15s primitive fs_mysql ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/r0 directory=/var/lib/mysql fstype=xfs primitive ip_mysql ocf:heartbeat:IPaddr2 \ params ip=192.168.2.92 nic=eth0 primitive mysqld lsb:mysql group mysql fs_mysql ip_mysql mysqld ms ms_drbd_mysql drbd_mysql \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true location cli-prefer-mysql mysql \ rule $id=cli-prefer-rule-mysql inf: #uname eq linsrv001.willi-net.local colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start property $id=cib-bootstrap-options \ dc-version=1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false My second question is, what happens If one node fails and I have to setup the hole node again. If I start OpenAIS/Corosync, what happens with the CIB?(will the cluster information configuration will be transfered to the node?) Regards - Willi ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] The active trap of the SNMP is delayed.
Hi All, We are troubled in the face of this problem. Please give advice. * This problem changed the destination of the mailing list to seem to be a problem of the HA. Best Regards, Hideo Yamauchi. --- On Fri, 2011/6/17, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi All, I registered this problem in Bugzilla. * http://developerbugs.linux-foundation.org/show_bug.cgi?id=2604 Best Regards, Hideo Yamauch. --- On Wed, 2011/6/15, renayama19661...@ybb.ne.jp renayama19661...@ybb.ne.jp wrote: Hi All, I found a problem with a trap of the SNMP.(from hbagent.) A trap of active of the node seems to have possibilities to be delayed. In addition, this problem sometimes occurs and does not always occur. I confirmed it in the next procedure. Step1) Start a node. Last updated: Wed Jun 15 19:23:39 2011 Stack: Heartbeat Current DC: srv02 (afe72fff-b7b4-4663-b845-872df29c635d) - partition WITHOUT quorum Version: 1.0.11-6e010d6b0d49a6b929d17c0114e9d2d934dc8e04 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ srv01 srv02 ] Resource Group: group-1 prmDummy1 (ocf::heartbeat:Dummy): Started srv01 Migration summary: * Node srv02: * Node srv01: Step2) Intercept one interface of the Heartbeat communication. # iptables -A INPUT -i eth1 -s ! 192.168.10.110 -j DROP # iptables -A INPUT -i eth1 -s ! 192.168.10.120 -j DROP Step3) The next trap is received in SNMP managers. (snip) Jun 15 19:24:30 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:30 UNKNOWN [UDP: [192.168.40.120]:59010]: DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23014) 0:03:50.14 SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHAIFStatusUpdate LINUX-HA-MIB::LHANodeName = STRING: srv01 LINUX-HA-MIB::LHAIFName = STRING: eth1 LINUX-HA-MIB::LHAIFStatus = INTEGER: down(2) No problem. Jun 15 19:24:32 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:32 UNKNOWN [UDP: [192.168.40.110]:44001]: DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23597) 0:03:55.97 SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHANodeStatusUpdate LINUX-HA-MIB::LHANodeName = STRING: srv02 LINUX-HA-MIB::LHANodeStatus = INTEGER: active(3) The trap of active is improper in this timing. Jun 15 19:24:34 snmp-manager snmptrapd[4771]: 2011-06-15 19:24:34 UNKNOWN [UDP: [192.168.40.110]:44001]: DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (23803) 0:03:58.03 SNMPv2-MIB::snmpTrapOID.0 = OID: LINUX-HA-MIB::LHAIFStatusUpdate LINUX-HA-MIB::LHANodeName = STRING: srv02 LINUX-HA-MIB::LHAIFName = STRING: eth1 LINUX-HA-MIB::LHAIFStatus = INTEGER: down(2) No problem. (snip) Between the traps which interface intercepted, it is strange that the active trap of the node comes. And I think that it is necessary for the active trap to be sent in an earlier timing. This problem seems to happen in Heartbeat2.1.4. I watched some sources, but think that client_lib of Heartbeat has a problem somehow or other. Transmitted F_STATUS message is late and seems to be handled. Best Regards, Hideo Yamauchi. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems