Re: [Linux-HA] pcs configuration issue
Hi Willi, On 2013-12-09 07:58, Willi Fehler wrote: Hi Chris, I've upgraded to CentOS-6.5 with the latest version of pcs but the issue still exists. [root@linsrv006 ~]# pcs constraint location mysql rule score=pingd defined pingd Error: 'mysql' is not a resource Do I need to download the latest pcs code and build my own package? pcs is not mandatory and you can download and use the crm shell for Centos from: http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/ Regards, Andreas Regards - Willi Am 03.12.13 01:54, schrieb Chris Feist: On 11/26/2013 03:27 AM, Willi Fehler wrote: Hello, I'm trying to create the following setup in Pacemaker 1.1.10. pcs property set no-quorum-policy=ignore pcs property set stonith-enabled=false pcs resource create drbd_mysql ocf:linbit:drbd drbd_resource=r0 op monitor interval=60s pcs resource master ms_drbd_mysql drbd_mysql master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true pcs resource create fs_mysql Filesystem device=/dev/drbd/by-res/r0 directory=/var/lib/mysql fstype=xfs options=noatime pcs resource create ip_mysql IPaddr2 ip=192.168.0.12 cidr_netmask=32 op monitor interval=20s pcs resource create ping ocf:pacemaker:ping host_list=192.168.0.1 multiplier=100 dampen=10s op monitor interval=60s pcs resource clone ping ping_clone globally-unique=false pcs resource create mysqld ocf:heartbeat:mysql binary=/usr/sbin/mysqld datadir=/var/lib/mysql config=/etc/my.cnf \ pid=/var/run/mysqld/mysqld.pid socket=/var/run/mysqld/mysqld.sock \ op monitor interval=15s timeout=30s op start interval=0 timeout=180s op stop interval=0 timeout=300s pcs resource group add mysql fs_mysql mysqld ip_mysql pcs constraint colocation add mysql ms_drbd_mysql INFINITY with-rsc-role=Master pcs constraint order promote ms_drbd_mysql then start mysql pcs constraint location mysql rule pingd: defined ping There are two issues here, first, there's a bug in pcs which doesn't recognize groups in location constraint rules and the second is that the pcs rule syntax is slightly different than crm. You should be able to use this command with the latest upstream: pcs constraint location mysql rule score=pingd defined pingd Thanks, Chris The last line is not working: [root@linsrv006 ~]# pcs constraint location mysql rule pingd: defined pingd Error: 'mysql' is not a resource By the way, can everybody verify the other lines? I'm very now to pcs. Here is my old configuration. crm configure crm(live)configure#primitive drbd_mysql ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=10s role=Master \ op monitor interval=20s role=Slave \ op start interval=0 timeout=240 \ op stop interval=0 timeout=240 crm(live)configure#ms ms_drbd_mysql drbd_mysql \ meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 \ notify=true target-role=Master crm(live)configure#primitive fs_mysql ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/r0 directory=/var/lib/mysql fstype=xfs options=noatime \ op start interval=0 timeout=180s \ op stop interval=0 timeout=300s \ op monitor interval=60s crm(live)configure#primitive ip_mysql ocf:heartbeat:IPaddr2 \ params ip=192.168.0.92 cidr_netmask=24 \ op monitor interval=20 crm(live)configure#primitive ping_eth0 ocf:pacemaker:ping \ params host_list=192.168.0.1 multiplier=100 \ op monitor interval=10s timeout=20s \ op start interval=0 timeout=90s \ op stop interval=0 timeout=100s crm(live)configure#clone ping_eth0_clone ping_eth0 \ meta globally-unique=false crm(live)configure#primitive mysqld ocf:heartbeat:mysql \ params binary=/usr/sbin/mysqld datadir=/var/lib/mysql config=/etc/my.cnf pid=/var/run/mysqld/mysqld.pid socket=/var/run/mysqld/mysqld.sock \ op monitor interval=15s timeout=30s \ op start interval=0 timeout=180s \ op stop interval=0 timeout=300s \ meta target-role=Started crm(live)configure#group mysql fs_mysql mysqld ip_mysql crm(live)configure#location l_mysql_on_01 mysql 100: linsrv001.willi-net.local crm(live)configure#location mysql-on-connected-node mysql \ rule $id=mysql-on-connected-node-rule -inf: not_defined pingd or pingd lte 0 crm(live)configure#colocation mysql_on_drbd inf: mysql
Re: [Linux-HA] problem with pgsql streaming resource agent
On 2013-07-08 19:40, Jeff Frost wrote: We're testing out the pgsql master slave streaming replication resource agent that's found here: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql and using the example 2-node configuration found here https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-PostgreSQL-9.1-streaming-replication as a template, we came up with the following configuration: node node1 node node2 primitive pgsql ocf:heartbeat:pgsql \ params pgctl=/usr/pgsql-9.2/bin/pg_ctl psql=/usr/pgsql-9.2/bin/psql pgdata=/var/lib/pgsql/9.2/data/ start_opt=-p 5432 rep_mode=async node_list=node1 node2 repuser=replicauser restore_command=rsync -aq /var/lib/pgsql/wal_archive/%f %p master_ip=192.168.253.104 stop_escalate=0 \ op start interval=0s role=Master timeout=60s on-fail=block Looks like you are missing the monitor operations ... as described in the example you are referring. In the monitoring operation such master-slave agents recalculate their master-score and refresh e.g. in this RA various node-attributes. And you should follow the described procedures to correctly start-up the cluster. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resource Always Tries to Start on the Wrong Node
Hello Eric, On 2013-06-27 17:35, Robinson, Eric wrote: -Original Message- I don't understand why resources try to start on the wrong node (and of course fail). Pacemaker 1.0.7 ... looking at the Changelog of Pacemaker 1.0 at https://github.com/ClusterLabs/pacemaker-1.0/blob/master/ChangeLog there are quite some colocation fixes also for sets after that version. I'd say you hit an already fixed bug. Regards, Andreas My nodes are ha05 and ha06. ha05 is master/primary and all resources are running on it. If I run... crm resource stop p_MySQL_185 ..the resource stops fine. Then if I run... crm resource start p_MySQL_185 ..the resource fails to start, and crm_mon shows that it tried to start on the wrong node (ha06). Then if I run... crm resource cleanup p_MySQL_185 ..the resource cleans up and starts on node ha05. Here is my crm config... node ha05.mycharts.md \ attributes standby=off node ha06.mycharts.md primitive p_ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.10.205 cidr_netmask=32 \ op monitor interval=15s \ meta target-role=Started primitive p_DRBD ocf:linbit:drbd \ params drbd_resource=ha_mysql \ op monitor interval=15s primitive p_FileSystem ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/ha_mysql fstype=ext3 options=noatime primitive p_MySQL_000 lsb:mysql_000 primitive p_MySQL_001 lsb:mysql_001 primitive p_MySQL_054 lsb:mysql_054 primitive p_MySQL_057 lsb:mysql_057 primitive p_MySQL_103 lsb:mysql_103 primitive p_MySQL_106 lsb:mysql_106 primitive p_MySQL_139 lsb:mysql_139 primitive p_MySQL_140 lsb:mysql_140 primitive p_MySQL_141 lsb:mysql_141 primitive p_MySQL_142 lsb:mysql_142 primitive p_MySQL_143 lsb:mysql_143 primitive p_MySQL_144 lsb:mysql_144 primitive p_MySQL_145 lsb:mysql_145 primitive p_MySQL_146 lsb:mysql_146 primitive p_MySQL_147 lsb:mysql_147 primitive p_MySQL_148 lsb:mysql_148 primitive p_MySQL_149 lsb:mysql_149 primitive p_MySQL_150 lsb:mysql_150 primitive p_MySQL_151 lsb:mysql_151 primitive p_MySQL_152 lsb:mysql_152 primitive p_MySQL_153 lsb:mysql_153 primitive p_MySQL_154 lsb:mysql_154 primitive p_MySQL_157 lsb:mysql_157 primitive p_MySQL_158 lsb:mysql_158 primitive p_MySQL_160 lsb:mysql_160 primitive p_MySQL_161 lsb:mysql_161 primitive p_MySQL_162 lsb:mysql_162 primitive p_MySQL_163 lsb:mysql_163 primitive p_MySQL_164 lsb:mysql_164 primitive p_MySQL_165 lsb:mysql_165 primitive p_MySQL_167 lsb:mysql_167 primitive p_MySQL_168 lsb:mysql_168 primitive p_MySQL_169 lsb:mysql_169 primitive p_MySQL_170 lsb:mysql_170 primitive p_MySQL_171 lsb:mysql_171 primitive p_MySQL_172 lsb:mysql_172 primitive p_MySQL_173 lsb:mysql_173 primitive p_MySQL_174 lsb:mysql_174 primitive p_MySQL_175 lsb:mysql_175 primitive p_MySQL_176 lsb:mysql_176 primitive p_MySQL_177 lsb:mysql_177 primitive p_MySQL_178 lsb:mysql_178 primitive p_MySQL_179 lsb:mysql_179 primitive p_MySQL_180 lsb:mysql_180 primitive p_MySQL_181 lsb:mysql_181 primitive p_MySQL_182 lsb:mysql_182 primitive p_MySQL_183 lsb:mysql_183 primitive p_MySQL_184 lsb:mysql_184 primitive p_MySQL_185 lsb:mysql_185 \ meta target-role=Started primitive p_MySQL_186 lsb:mysql_186 primitive p_MySQL_187 lsb:mysql_187 primitive p_MySQL_188 lsb:mysql_188 primitive p_MySQL_189 lsb:mysql_189 \ meta target-role=Started primitive p_MySQL_191 lsb:mysql_191 primitive p_MySQL_192 lsb:mysql_192 primitive p_MySQL_194 lsb:mysql_194 primitive p_MySQL_195 lsb:mysql_195 primitive p_MySQL_196 lsb:mysql_196 primitive p_MySQL_197 lsb:mysql_197 primitive p_MySQL_198 lsb:mysql_198 primitive p_MySQL_199 lsb:mysql_199 primitive p_MySQL_200 lsb:mysql_200 primitive p_MySQL_201 lsb:mysql_201 primitive p_MySQL_202 lsb:mysql_202 primitive p_MySQL_203 lsb:mysql_203 primitive p_MySQL_204 lsb:mysql_204 primitive p_MySQL_205 lsb:mysql_205 primitive p_MySQL_206 lsb:mysql_206 primitive p_MySQL_207 lsb:mysql_207 primitive p_MySQL_208 lsb:mysql_208 primitive p_MySQL_209 lsb:mysql_209 primitive p_MySQL_210 lsb:mysql_210 primitive p_MySQL_211 lsb:mysql_211 primitive p_MySQL_212 lsb:mysql_212 primitive p_MySQL_213 lsb:mysql_213 primitive p_MySQL_214 lsb:mysql_214 primitive p_MySQL_215 lsb:mysql_215 primitive p_MySQL_216 lsb:mysql_216 primitive p_MySQL_217 lsb:mysql_217 primitive p_MySQL_218 lsb:mysql_218 primitive p_MySQL_219 lsb:mysql_219 primitive p_MySQL_220 lsb:mysql_220 primitive p_MySQL_221 lsb:mysql_221 primitive p_MySQL_222 lsb:mysql_222 primitive p_MySQL_224 lsb:mysql_224 primitive p_MySQL_225 lsb:mysql_225 primitive p_MySQL_226 lsb:mysql_226 primitive p_MySQL_227 lsb:mysql_227 primitive p_MySQL_228 lsb:mysql_228 primitive p_MySQL_229 lsb:mysql_229 ms ms_DRBD p_DRBD \ meta master-max=1 master-node-max=1 clone-max=2 clone- node-max=1 notify=true colocation c_virtdb03 inf: ( p_MySQL_001 p_MySQL_139 p_MySQL_140 p_MySQL_141 p_MySQL_142 p_MySQL_143
Re: [Linux-HA] Fwd: stonith with sbd not working
On 2013-04-10 17:47, Fredrik Hudner wrote: Hi Lars, I wouldn't mind to try to install one of the tar balls from http://hg.linux-ha.org/sbd, only I'm not sure how to do it after I've unzipped/tar it. I saw someone from a discussion group that wanted to do it as well.. If you only tell me how to get it installed (will upgrade pacemaker to 1.1.8) sbd-1837fd8cc64a that would be great. I noticed after after I checked a bit closer that all those sbd-common.c, sbd.h etc were missing.. (so yeah it was a packaging issue as such). And if it won't work.. what options do I have in a VMware environment ? any suggestions ESX? ... on RHEL/Centos the suggestion would be fence_vmware_soap Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now krgds /Fredrik On Wed, Apr 10, 2013 at 5:34 PM, Lars Marowsky-Bree l...@suse.com wrote: On 2013-04-10T15:25:38, Fredrik Hudner fredrik.hud...@gmail.com wrote: and removed watchdog from the system but without success.. Still can't see any references that sbd has started in the messages log It looks as if the init script of pacemaker (openais/corosync on SUSE) is not taking care to start the sbd daemon. I don't think RHT will support sbd on RHEL anyway. I don't think that has been tested, sorry :-( This looks like a packaging issue. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: manage/umanage
On 2013-03-27 08:30, Moullé Alain wrote: Hi Thanks but I never asked to run monitoring on an unmanaged resource ... ? ! I ask for the opposite : a way to set one resource in a state near to umanage, meaning umanaged and wo monitoring, and wo to be forced to set all the cluster-management umanaged with maitenance-mode=true. I think that, with regards to the responses, this function does not exist ... You can set a resource to unmanaged and additionally add enabled=false to the monitor definition. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Alain Le 27/03/2013 08:24, Ulrich Windl a écrit : Hi! I see little sense to run monitoring on an unmanaged resource, specifically as some _monitoring_ operations are not strict read-only, but do change the state of a resource (which may be quite unexpected). One example is the RAID RA, which tries to re-add missing devices. Regards, Ulrich Moullé Alainalain.mou...@bull.net schrieb am 27.03.2013 um 07:56 in Nachricht 51529820.7050...@bull.net: Hi OK thanks, but sorry it was not quite the response I was expected as I already know all that about cleanup, reprobe, etc. So more clearly my question was : Is there a way by crm to invalidate the monitoring temorarily for one specific resource ? Thanks Alain Hi, On Mon, 25 Mar 2013 16:25:54 +0100 Moullé Alain alain.mou...@bull.net wrote: I've tested two things : 1/ if we set maintenance-mode=true : all the configured ressources become 'unmanaged' , as displayed with crm_mon ok start stop are no more accepted and it seems that ressources are no more monitored any more by pacemaker Probably maintainance-mode also tells the cluster-manager to completely stop monitoring. 2/ if we target only one resource via the crm resource umanage resname : it is also displayed unmanage with crm_mon ok start stop are no more accepted BUT pacemaker always monitors the resource Is there a reason for this difference ? Its un-managed, not un-monitored ;-) Actually this is not a problem, it will monitor as long as the service is up. As the first monitor-action fails, the resource is marked as failed and no more monitor action is run. Until you explicitely ask for it with cleanup resource or reprobe node. Have fun, Arnold ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Many Resources Dependent on One Resource Group
On 2013-03-24 17:58, Robinson, Eric wrote: In the simplest terms, we currently have resources: A = drbd B = filesystem C = cluster IP D thru J = mysql instances. Resource group G1 consists of resources B through J, in that order, and is dependent on resource A. This fails over fine, but it has the serious disadvantage that if you stop or remove a mysql resource in the middle of the list, all of the ones after it stop too. For example, if you stop G, then H thru J stop as well. We want to change it so that the resource group G1 consists only of resources B C. All of the mysql instances (D thru J) are individually dependent on group G1, but not dependent on each other. That way you can stop or remove a mysql resource without affecting the others. I saw this scenario described in the Pacemaker docs, but I cannot find an example of the syntax. You can use two resource-sets and go without groups, with that crm shell syntax: order o_drbd-filesystem-ip-dbs inf: A:promote B C (D E F G H I J) colocate co_all-follow-drbd inf: (D E F G H I J) B C A:Master Regards, Andreas -- Eric Robinson Disclaimer - March 24, 2013 This email and any files transmitted with it are confidential and intended solely for 'General Linux-HA mailing list'. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physicians' Managed Care or Physician Select Management. Warning: Although Physicians' Managed Care or Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This disclaimer was added by Policy Patrol: http://www.policypatrol.com/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Need help with Pacemaker? http://www.hastexo.com/now ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Fwd: Problem promoting slave to master
On 2013-03-20 13:30, Fredrik Hudner wrote: I presume you are correct about that. (see drbdadm-dump.txt) fence-peer /usr/lib/drbd/crm-fence-peer.sh; after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh; after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; ... to remove the constraint, once secondary is in sync again after a resync run. Regards, Andreas What would I need to do to overwrite it ? Or if you have a nicer way to do it.. It's not easy to take over someones else configuration always Kind regards /Fredrik On Tue, Mar 19, 2013 at 11:32 PM, Andreas Kurz andr...@hastexo.com wrote: On 2013-03-19 16:02, Fredrik Hudner wrote: Just wanted to change what document it*s been built from.. It should be LINBIT DRBD 8.4 Configuration Guide: NFS on RHEL 6 There is again that fencing-constraint in your configuration what does drbdadm dump all look like? Any chance you only specified a fence-peer handler in you resource configuration but don't overwrite that after-resync-target handler you specified in your global_common.conf ... that would explain that dangling constraint that will prevent a failover. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now -- Forwarded message -- From: Fredrik Hudner fredrik.hud...@gmail.com Date: Mon, Mar 18, 2013 at 11:06 AM Subject: Re: [Linux-HA] Problem promoting slave to master To: General Linux-HA mailing list linux-ha@lists.linux-ha.org On Fri, Mar 15, 2013 at 1:04 AM, Andreas Kurz andr...@hastexo.com wrote: On 2013-03-14 15:52, Fredrik Hudner wrote: I set no-quorum-policy to ignore and removed the constraint you mentioned. It then managed to failover once to the slave node, but I still have those. Failed actions: p_exportfs_root:0_monitor_ 3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running This only tells you that monitoring of these resources found them once not running logs should tell you what when that happens I have attached the logs from master and slave.. I can see that it stops, but not really why (to limited knowledge to read the logs) I then stoped the new maste-node to see if it fell over to the other node with no success.. It remains slave. Hard to say without seeing current cluster state like a crm_mon -1frA, cat /proc/drbd and some logs ... not enough information ... I have attached the output from crm_mon, the new crm configure and /proc/drbd I also noticed that the constraint drbd-fence-by-handler-nfs-ms_drbd_nfs was back in the crm configure. Seems like cib makes a replace This constraint is added by the DRBD primary if it looses connection to its peer and is perfectly fine if you stopped one node. Seems like the cluster have a problem attaching to the cluster node ip, but I'm not sure why i would like to add, that I took over this configuration from a guy that has left, but I know that it's configured by using the technical documentation from LINBIT Highly available NFS storage with DRBD and Pacemaker. Mar 14 15:06:18 [1786] tdtestclu02 crmd: info: abort_transition_graph:te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) : Non-status change Mar 14 15:06:18 [1786] tdtestclu02 crmd: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 14 15:06:18 [1781] tdtestclu02cib: info: cib_replace_notify:Replaced: 0.780.39 - 0.781.1 from tdtestclu01 So not sure how to remove that constraint on a permanent basis.. it's gone as long as I don't stop pacemaker. Once the DRBD resync is finished it will be removed from the cluster configuration again automatically... you typically never need to remove such drbd-fence-constraints manually only in some rare failure scenarios. Regards, Andreas But it used to work booth with the no-quorom-policy=freeze and that constraint Kind regards /Fredrik On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz andr...@hastexo.com wrote: On 2013-03-14 13:30, Fredrik Hudner wrote: Hi all, I have a problem after I removed a node with the force command from my crm config. Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 1.1.7-6.el6) Then I wanted to add a third node acting as quorum node, but was not able to get it to work (probably because I don’t understand how to set it up). So I removed the 3rd node, but had to use the force command as crm complained when I tried to remove it. Now when I start up Pacemaker the resources doesn’t look like they come up correctly Online: [ testclu01 testclu02 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ testclu01
Re: [Linux-HA] Fwd: Problem promoting slave to master
On 2013-03-19 16:02, Fredrik Hudner wrote: Just wanted to change what document it*s been built from.. It should be LINBIT DRBD 8.4 Configuration Guide: NFS on RHEL 6 There is again that fencing-constraint in your configuration what does drbdadm dump all look like? Any chance you only specified a fence-peer handler in you resource configuration but don't overwrite that after-resync-target handler you specified in your global_common.conf ... that would explain that dangling constraint that will prevent a failover. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now -- Forwarded message -- From: Fredrik Hudner fredrik.hud...@gmail.com Date: Mon, Mar 18, 2013 at 11:06 AM Subject: Re: [Linux-HA] Problem promoting slave to master To: General Linux-HA mailing list linux-ha@lists.linux-ha.org On Fri, Mar 15, 2013 at 1:04 AM, Andreas Kurz andr...@hastexo.com wrote: On 2013-03-14 15:52, Fredrik Hudner wrote: I set no-quorum-policy to ignore and removed the constraint you mentioned. It then managed to failover once to the slave node, but I still have those. Failed actions: p_exportfs_root:0_monitor_ 3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running This only tells you that monitoring of these resources found them once not running logs should tell you what when that happens I have attached the logs from master and slave.. I can see that it stops, but not really why (to limited knowledge to read the logs) I then stoped the new maste-node to see if it fell over to the other node with no success.. It remains slave. Hard to say without seeing current cluster state like a crm_mon -1frA, cat /proc/drbd and some logs ... not enough information ... I have attached the output from crm_mon, the new crm configure and /proc/drbd I also noticed that the constraint drbd-fence-by-handler-nfs-ms_drbd_nfs was back in the crm configure. Seems like cib makes a replace This constraint is added by the DRBD primary if it looses connection to its peer and is perfectly fine if you stopped one node. Seems like the cluster have a problem attaching to the cluster node ip, but I'm not sure why i would like to add, that I took over this configuration from a guy that has left, but I know that it's configured by using the technical documentation from LINBIT Highly available NFS storage with DRBD and Pacemaker. Mar 14 15:06:18 [1786] tdtestclu02 crmd: info: abort_transition_graph:te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) : Non-status change Mar 14 15:06:18 [1786] tdtestclu02 crmd: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 14 15:06:18 [1781] tdtestclu02cib: info: cib_replace_notify:Replaced: 0.780.39 - 0.781.1 from tdtestclu01 So not sure how to remove that constraint on a permanent basis.. it's gone as long as I don't stop pacemaker. Once the DRBD resync is finished it will be removed from the cluster configuration again automatically... you typically never need to remove such drbd-fence-constraints manually only in some rare failure scenarios. Regards, Andreas But it used to work booth with the no-quorom-policy=freeze and that constraint Kind regards /Fredrik On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz andr...@hastexo.com wrote: On 2013-03-14 13:30, Fredrik Hudner wrote: Hi all, I have a problem after I removed a node with the force command from my crm config. Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 1.1.7-6.el6) Then I wanted to add a third node acting as quorum node, but was not able to get it to work (probably because I don’t understand how to set it up). So I removed the 3rd node, but had to use the force command as crm complained when I tried to remove it. Now when I start up Pacemaker the resources doesn’t look like they come up correctly Online: [ testclu01 testclu02 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ testclu01 ] Slaves: [ testclu02 ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ tdtestclu01 tdtestclu02 ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started testclu01 p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01 p_fs_shared2 (ocf::heartbeat:Filesystem):Started testclu01 p_ip_nfs (ocf::heartbeat:IPaddr2): Started testclu01 Clone Set: cl_exportfs_root [p_exportfs_root] Started: [ testclu01 testclu02 ] Failed actions: p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7, status=complete): not running
Re: [Linux-HA] Need some help for corosync/pacemaker with slapd master/slave
On 2013-03-18 16:53, guilla...@cheramy.name wrote: Hello, I'll try for a week to make a cluster with 2 servers for slapd HA. ldap01 is the master slapd server, ldap02 is a replica server. All is ok with Debian and /etc/ldap/slapd.conf configuration. So now I wants if ldap01 fail that ldap02 became master for continue to add data in ldap database. And if ldap01 failback I wants it became slave of ldap02 for resync data. So I'll try this configuration with pacemaker multi-state with this configuration : # crm configure show node ldap01 \ attributes standby=off node ldap02 \ attributes standby=off primitive LDAP-SERVER ocf:custom:slapd-ms \ op monitor interval=30s primitive VIP-1 ocf:heartbeat:IPaddr2 \ params ip=10.0.2.30 broadcast=10.0.2.255 nic=eth0 cidr_netmask=24 iflabel=VIP1 \ op monitor interval=30s timeout=20s ms MS-LDAP-SERVER LDAP-SERVER \ params config=/etc/ldap/slapd.conf lsb_script=/etc/init.d/slapd \ meta clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=false target-role=Master colocation LDAP-WITH-IP inf: VIP-1 MS-LDAP-SERVER order LDAP-AFTER-IP inf: VIP-1 MS-LDAP-SERVER property $id=cib-bootstrap-options \ dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1363619724 I use a ocf slapd-ms script provided by this article : http://foaa.de/old-blog/2010/10/intro-to-pacemaker-part-2-advanced-topics/trackback/index.html#master-slave-primus-inter-pares interesting setup and resource agent you should really go with the well tested slapd resource agent that comes with the resource-agent package, use slapd as a clone resource and setup slapd to run in mirror-mode. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now So my problem is that all the deux nodes start as Slaves for MS-LDAP-SERVER : Last updated: Mon Mar 18 16:53:05 2013 Last change: Mon Mar 18 16:53:01 2013 via crmd on ldap02 Stack: openais Current DC: ldap01 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 3 Resources configured. Online: [ ldap01 ldap02 ] VIP-1 (ocf::heartbeat:IPaddr2): Started ldap01 Master/Slave Set: MS-LDAP-SERVER [LDAP-SERVER] Slaves: [ ldap01 ldap02 ] Could somone helps me ? Thanks ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem promoting slave to master
On 2013-03-14 13:30, Fredrik Hudner wrote: Hi all, I have a problem after I removed a node with the force command from my crm config. Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 1.1.7-6.el6) Then I wanted to add a third node acting as quorum node, but was not able to get it to work (probably because I don’t understand how to set it up). So I removed the 3rd node, but had to use the force command as crm complained when I tried to remove it. Now when I start up Pacemaker the resources doesn’t look like they come up correctly Online: [ testclu01 testclu02 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ testclu01 ] Slaves: [ testclu02 ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ tdtestclu01 tdtestclu02 ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started testclu01 p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01 p_fs_shared2 (ocf::heartbeat:Filesystem):Started testclu01 p_ip_nfs (ocf::heartbeat:IPaddr2): Started testclu01 Clone Set: cl_exportfs_root [p_exportfs_root] Started: [ testclu01 testclu02 ] Failed actions: p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running The filesystems mount correctly on the master at this stage and can be written to. When I stop the services on the master node for it to failover, it doesn’t work.. Looses cluster-ip connectivity fix your no-quorum-policy, you want to ignore the quorum in a two-node cluster to allow failover ... and if your drbd device is already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs constraint. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Corosync.log from master after I stopped pacemaker on master node : see attached file Additional files (attached): crm-configure show Corosync.conf Global_common.conf I’m not sure how to proceed to get it up in a fair state now So if anyone could help me it would be much appreciated Kind regards /Fredrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem promoting slave to master
On 2013-03-14 15:52, Fredrik Hudner wrote: I set no-quorum-policy to ignore and removed the constraint you mentioned. It then managed to failover once to the slave node, but I still have those. Failed actions: p_exportfs_root:0_monitor_ 3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running This only tells you that monitoring of these resources found them once not running logs should tell you what when that happens I then stoped the new maste-node to see if it fell over to the other node with no success.. It remains slave. Hard to say without seeing current cluster state like a crm_mon -1frA, cat /proc/drbd and some logs ... not enough information ... I also noticed that the constraint drbd-fence-by-handler-nfs-ms_drbd_nfs was back in the crm configure. Seems like cib makes a replace This constraint is added by the DRBD primary if it looses connection to its peer and is perfectly fine if you stopped one node. Mar 14 15:06:18 [1786] tdtestclu02 crmd: info: abort_transition_graph:te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) : Non-status change Mar 14 15:06:18 [1786] tdtestclu02 crmd: notice: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Mar 14 15:06:18 [1781] tdtestclu02cib: info: cib_replace_notify:Replaced: 0.780.39 - 0.781.1 from tdtestclu01 So not sure how to remove that constraint on a permanent basis.. it's gone as long as I don't stop pacemaker. Once the DRBD resync is finished it will be removed from the cluster configuration again automatically... you typically never need to remove such drbd-fence-constraints manually only in some rare failure scenarios. Regards, Andreas But it used to work booth with the no-quorom-policy=freeze and that constraint Kind regards /Fredrik On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz andr...@hastexo.com wrote: On 2013-03-14 13:30, Fredrik Hudner wrote: Hi all, I have a problem after I removed a node with the force command from my crm config. Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6, pacemaker 1.1.7-6.el6) Then I wanted to add a third node acting as quorum node, but was not able to get it to work (probably because I don’t understand how to set it up). So I removed the 3rd node, but had to use the force command as crm complained when I tried to remove it. Now when I start up Pacemaker the resources doesn’t look like they come up correctly Online: [ testclu01 testclu02 ] Master/Slave Set: ms_drbd_nfs [p_drbd_nfs] Masters: [ testclu01 ] Slaves: [ testclu02 ] Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver] Started: [ tdtestclu01 tdtestclu02 ] Resource Group: g_nfs p_lvm_nfs (ocf::heartbeat:LVM): Started testclu01 p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01 p_fs_shared2 (ocf::heartbeat:Filesystem):Started testclu01 p_ip_nfs (ocf::heartbeat:IPaddr2): Started testclu01 Clone Set: cl_exportfs_root [p_exportfs_root] Started: [ testclu01 testclu02 ] Failed actions: p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7, status=complete): not running p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7, status=complete): not running The filesystems mount correctly on the master at this stage and can be written to. When I stop the services on the master node for it to failover, it doesn’t work.. Looses cluster-ip connectivity fix your no-quorum-policy, you want to ignore the quorum in a two-node cluster to allow failover ... and if your drbd device is already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs constraint. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Corosync.log from master after I stopped pacemaker on master node : see attached file Additional files (attached): crm-configure show Corosync.conf Global_common.conf I’m not sure how to proceed to get it up in a fair state now So if anyone could help me it would be much appreciated Kind regards /Fredrik ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org
Re: [Linux-HA] action monitor not advertised in meta-data
On 2013-02-06 15:32, Mario Linick wrote: Hi everyone, i have problems with a warn-message when configure a drbd-ra. clusterinfo: nodes: 2 os: sles11sp2 + sleha drbd-version: 8.41 pacemaker: 1.1.7 corosync: 1.4.3 I'm stuck trying to add the DRBD resources. Specifically, whenever I try to configure my DRBD resources I get the following: input are: crm_configure primitive drbd_r0 ocf:linbit:drbd params drbd_resource=r0 op monitor interval=15 the output is: WARNING: drbd_r0: action monitor not advertised in meta-data, it may not be supported by the RA my question: what did I do wrong (i don't understand the follows of this warning )? The DRBD RA only advertises monitor actions for Master Slave role. Try: crm_configure primitive drbd_r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=30s role=Slave \ op monitor interval=29s role=Master Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now can anyone help? Thanks in advance, Mario ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Node UNCLEAN (online)
On 12/13/2012 10:00 AM, Josh Bowling wrote: I have a two node cluster running Ubuntu 12.04 and had everything working just fine (STONITH, failover, etc.) until I changed up the virtual machines and the crm configuration to match (my virtual machines get booted by pacemaker on the survivor node when a failover occurs). The primary node currently has a status of UNCLEAN (online) as it tried to boot a VM that no longer existed - had changed the VMs but not the crm configuration at this point. I have since modified the configuration and synced data with DRBD so everything is good to go except for pacemaker. So the second node was offline? ... as it did not fence the primary node? Is there a way to remove the error and set the UNCLEAN node to just online? I think since it's currently seen as unclean, the new configuration won't propagate to the secondary node. In order to make ensure high availability, both machines need to be clean, online, and have the same crm configuration. Assuming the second node does not run pacemaker switch the cluster into maintenance-mode, adjust your crm configuration to reflect your changes to the VMs and restart pacemaker. Once you made sure all is in place and all vm configurations are successfully probed disable maintenance-mode and start pacemaker on the second node. I'm hoping there's a quick way to get this cluster back on track. I doubt one of my servers are going to fail any time soon, but you never know. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks in advance, Josh ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] how to defined a service start on a node
On 12/04/2012 08:56 PM, Emmanuel Saint-Joanis wrote: This setup might do the trick : primitive srv-mysql lsb:mysql \ op monitor interval=120 \ op start interval=0 timeout=60 on-fail=restart \ op stop interval=0 timeout=60s on-fail=ignore primitive srv-websphere lsb:websphere \ op monitor interval=120 \ op start interval=0 timeout=60 on-fail=restart \ op stop interval=0 timeout=60s on-fail=ignore ms ms-drbd-data drbd-data \ meta master-max=1 master-node-max=1 clone-max=2 notify=true target-role=Master colocation mysql-only-slave -inf: srv-mysql ms-drbd-data:Master with a score of -inf this would prevent to run srv-mysql on the same node forever ... even in case of a node failure ... using a negative but not -inf score should also allow them to run together in case of node failures. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now colocation websphere-only-master inf: srv-websphere ms-drbd-data:Master The keypoint is to ensure each service runs on different nodes After that you must configure the orders. 2012/12/4 alonerhu alone...@gmail.com I have two machines with corosync+pacemaker, and I want to run mysql + websphere, how can I defined mysql start on node1 and websphere start on node2? thanks. I use drbd for data sync. alonerhu via foxmail ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] master/slave drbd resource STILL will not failover
On 12/05/2012 09:31 PM, Robinson, Eric wrote: you could probably find the stop action in the RA and replace it with (e.g.) logger 'AIE ***I did not want this***' and then see what gets logged. -- Well, that worked, in the sense that the resource now fails over. I replaced the start and stop actions in the RA with logger commands. Now when I do 'crm node standby' on the primary, I get the following in the messages log (since there are two drbd resources): Dec 5 12:23:49 ha09b root: STOP action disabled Dec 5 12:23:51 ha09b root: STOP action disabled Dec 5 12:24:22 ha09b root: START action disabled Dec 5 12:24:25 ha09b root: START action disabled The resource then fails over, though it takes maybe 30 seconds to complete. I confirmed that /proc/drbd now shows the correct status on both nodes. I was able to repeat this back and forth a few times just to be sure. When one node is offline, crm_mon shows that the resources are stopped (which they actaully are NOT). Sigh. The RA is clearly not working right, but I don't know if that is the root cause of the failover problems or just a symptom of it. Now what? I'd go with DRBD 8.3.14, there are precompiled RPMs available e.g. from elrepo (testing) ... and I found this thread regarding DRBD Pacemaker 1.1.8 crm-fence-peer.sh is not working correctly without some modifications: http://www.gossamer-threads.com/lists/drbd/users/24550#24550 hth Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now --Eric Disclaimer - December 5, 2012 This email and any files transmitted with it are confidential and intended solely for General Linux-HA mailing list. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physicians' Managed Care or Physician Select Management. Warning: Although Physicians' Managed Care or Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This disclaimer was added by Policy Patrol: http://www.policypatrol.com/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How to configure resources on pacemaker?
On 12/05/2012 10:49 PM, Felipe Gutierrez wrote: Hi, I configured wrong my pacemaker. I have resources that are wrong. So I need to delete them and configure again. Does any one know how to remove resources and how to configure them correctly? to clean your complete configuration and start-over with an empty one, you can use: cibadmin --force --erase Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now I tryed to remove but the system acuse they are running, even if I stop them # crm resource stop vm8-DRBD # crm configure crm(live)configure# delete vm8-DRBD ERROR: resource vm8-DRBD is running, can't delete it # crm_mon --one-shot -V crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_rsc_op: Hard error - Cluster-FS-Mount_last_failure_0 failed with rc=6: Preventing Cluster-FS-Mount from re-starting anywhere in the cluster Last updated: Wed Dec 5 19:42:56 2012 Last change: Wed Dec 5 18:44:16 2012 via crmd on cloud8 Stack: Heartbeat Current DC: cloud8 (949237ab-9f7d-47d1-b4ad-39e4583d8f0d) - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, unknown expected votes 3 Resources configured. Node cloud8 (949237ab-9f7d-47d1-b4ad-39e4583d8f0d): UNCLEAN (online) Node cloud10 (6f2a6b44-00c1-4ef2-b936-534a21f3dc45): UNCLEAN (online) Failed actions: vm8_start_0 (node=cloud8, call=14, rc=5, status=complete): not installed vm8-DRBD:1_stop_0 (node=cloud8, call=21, rc=5, status=complete): not installed Cluster-FS-Mount_stop_0 (node=cloud8, call=11, rc=6, status=complete): not configured vm8_start_0 (node=cloud10, call=8, rc=1, status=complete): unknown error vm8-DRBD:0_stop_0 (node=cloud10, call=18, rc=5, status=complete): not installed Thanks in advance, Felipe signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker symmetric-cluster false problem.
On 11/12/2012 05:49 PM, Rafał Radecki wrote: Hi all. I have a cluster of 4 nodes: - lb1.local, lb2.local: pound, varnish, memcache, nginx; - storage1, storage2: no primitives/resources yet. I want to set up an opt-in cluster http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch06s02s02.html. My config is: node lb1.local node lb2.local node storage1 node storage2 primitive LBMemcached lsb:memcached \ meta target-role=Started primitive LBPound lsb:pound primitive LBVIP ocf:heartbeat:IPaddr2 \ params ip=10.0.2.200 cidr_netmask=32 \ op monitor interval=30s \ meta priority=100 target-role=Started is-managed=true primitive LBVarnish lsb:varnish \ meta target-role=Started primitive MCVIP ocf:heartbeat:IPaddr2 \ params ip=192.168.100.200 cidr_netmask=32 \ op monitor interval=30s \ meta target-role=Started location LBMemcached_not_storage1 LBMemcached -inf: storage1 location LBMemcached_not_storage2 LBMemcached -inf: storage2 location LBMemcached_prefer_lb1 LBMemcached 100: lb1.local location LBMemcached_prefer_lb2 LBMemcached 200: lb2.local location LBPound_not_storage1 LBPound -inf: storage1 location LBPound_not_storage2 LBPound -inf: storage2 location LBPound_prefer_lb1 LBPound 200: lb1.local location LBPound_prefer_lb2 LBPound 100: lb2.local location LBVIP_not_storage1 LBVIP -inf: storage1 location LBVIP_not_storage2 LBVIP -inf: storage2 you are missing an explicit score for LBVIP on lb1/lb2 location LBVarnish_not_storage1 LBVarnish -inf: storage1 location LBVarnish_not_storage2 LBVarnish -inf: storage2 location LBVarnish_prefer_lb1 LBVarnish 200: lb1.local location LBVarnish_prefer_lb2 LBVarnish 200: lb2.local location MCVIP_not_storage1 MCVIP -inf: storage1 location MCVIP_not_storage2 MCVIP -inf: storage2 and another missing explicit score for MCVIP on lb1/lb2 Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now colocation LBMemcached_with_MCVIP inf: LBMemcached MCVIP colocation LBPound_with_LBVIP inf: LBPound LBVIP colocation LBVarnish_with_LBVIP inf: LBVarnish LBVIP order LBMemcached_after_MCVIP inf: MCVIP LBMemcached order LBPound_after_LBVIP inf: LBVIP LBPound order LBVarnish_after_LBVIP inf: LBVIP LBVarnish property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=4 \ stonith-enabled=false \ no-quorum-policy=ignore \ symmetric-cluster=true \ default-resource-stickiness=1 \ last-lrm-refresh=1352476209 rsc_defaults $id=rsc-options \ resource-stickiness=10 \ migration-threshold=100 The problem is that when I set crm_attribute --attr-name symmetric-cluster --attr-value true crm_mon: Last updated: Mon Nov 12 17:48:14 2012 Last change: Fri Nov 9 17:39:24 2012 via crm_attribute on storage1 Stack: openais Current DC: storage2 - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 4 Nodes configured, 4 expected votes 5 Resources configured. Online: [ lb1.local lb2.local storage1 storage2 ] LBVIP (ocf::heartbeat:IPaddr2): Started lb1.local LBPound (lsb:pound):Started lb1.local LBVarnish (lsb:varnish): Started lb1.local MCVIP (ocf::heartbeat:IPaddr2): Started lb2.local LBMemcached (lsb:memcached):Started lb2.loca but when I set crm_attribute --attr-name symmetric-cluster --attr-value false crm_mon: Last updated: Mon Nov 12 17:48:43 2012 Last change: Fri Nov 9 17:39:55 2012 via crm_attribute on storage1 Stack: openais Current DC: storage2 - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 4 Nodes configured, 4 expected votes 5 Resources configured. Online: [ lb1.local lb2.local storage1 storage2 ] As far as I can see I have proper priorities set for my primitives/resources. Any clues? Best regards, Rafal Radecki. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] STONITH for Amazon EC2 - fence_ec2
On 10/16/2012 05:16 PM, Borja García de Vinuesa wrote: Hi again! As I said, now pacemaker takes fence_ec2. However, it's not working as expected. This is what I find when starting pacemaker on all nodes (online ha1 and ha2 and ha3 in standby): -- Last updated: Tue Oct 16 17:11:47 2012 Last change: Tue Oct 16 16:57:46 2012 via cibadmin on ha1 Stack: openais Current DC: ha2 - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 3 Nodes configured, 3 expected votes 6 Resources configured. Node ha3: standby Online: [ ha1 ha2 ] Master/Slave Set: ms_drbd_r0 [p_drbd_r0] Masters: [ ha1 ] Slaves: [ ha2 ] Resource Group: g_r0 p_fs_r0(ocf::heartbeat:Filesystem):Started ha1 p_ip_r0(ocf::heartbeat:IPaddr2): Started ha1 p_r0 (ocf::heartbeat:mysql): Started ha1 Failed actions: stonith_my-ec2-nodes_start_0 (node=ha2, call=11, rc=1, status=complete): unknown error stonith_my-ec2-nodes_start_0 (node=ha1, call=7, rc=1, status=complete): unknown error - CONFIGURATION [root@ha1 ~]# crm configure show node ha1 node ha2 node ha3 \ attributes standby=on primitive p_drbd_r0 ocf:linbit:drbd \ params drbd_resource=r0 \ op monitor interval=15s primitive p_fs_r0 ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/var/lib/mysql_drbd fstype=ext4 primitive p_ip_r0 ocf:heartbeat:IPaddr2 \ params ip=54.X.X.X cidr_netmask=32 nic=eth0 primitive p_r0 ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf datadir=/var/lib/mysql_drbd/mysql_data pid=/var/run/mysqld/mysqld.pid socket=/var/lib/mysql/mysql.sock user=root group=root additional_parameters=--bind-address=54.X.X.X --user=root \ op start interval=0 timeout=120s \ op stop interval=0 timeout=120s \ op monitor interval=20s timeout=30s \ meta is-managed=true primitive stonith_my-ec2-nodes stonith:fence_ec2 \ params ec2-home=$EC2_HOME pcmk_host_check=static-list pcmk_host_list=ha1 ha2 ha3 \ I'd say you are missing that $EC2_HOME in the env ... but logs should tell you. I used full path here and it works for me. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now op monitor interval=600s timeout=300s \ op start start-delay=30s interval=0 group g_r0 p_fs_r0 p_ip_r0 p_r0 ms ms_drbd_r0 p_drbd_r0 \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true location cli-prefer-g_r0 g_r0 \ rule $id=cli-prefer-rule-g_r0 inf: #uname eq ha1 location cli-prefer-p_r0 p_r0 \ rule $id=cli-prefer-rule-p_r0 inf: #uname eq ha1 colocation c_r0_on_drbd inf: g_r0 ms_drbd_r0:Master order o_drbd_before_mysql inf: ms_drbd_r0:promote g_r0:start property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ no-quorum-policy=ignore \ stonith-enabled=false \ stonith-action=reboot \ last-lrm-refresh=1349707463 rsc_defaults $id=rsc-options \ resource-stickiness=100 With EC2_HOME=/tmp/ec2-api-tools-1.6.4 Any ideas on which could be the problem here? Thanks in advance! Borja Gª de Vinuesa O. Communications Administrator Skype: borja.garcia.actualize [Descripción: LogoActualizeRGB] http://www.actualize.es/ Avda. Valdelaparra, 27 P. E. Neisa Norte Edif. 1, 3ªpl. 28108 Alcobendas Madrid- España Tel. +34 91 799 40 70 Fax +34 91 799 40 79 www.actualize.eshttp://www.actualize.es/ ESPAÑA * BRASIL * CHILE * COLOMBIA * MEXICO * UK Aviso sobre confidencialidad Este documento se dirige exclusivamente a su destinatario y puede contener información confidencial o cuya divulgación debe estar autorizada en virtud de la legislación vigente. Se informa a quien lo recibiera sin ser el destinatario o persona autorizada por éste, que la información contenida en el mismo es reservada y su utilización o divulgación con cualquier fin está prohibida. Si ha recibido este documento por error, le rogamos que nos lo comunique inmediatamente por esta misma vía o por teléfono, y proceda a su destrucción. Confidentiality and Privacy: If you have received this email in error, please notify the sender and delete it as well as any attachments. The e-mail and any attached files are intended only for the use of the person or organisation to whom they are addressed. It is prohibited and may be unlawful to use copy or disclose these documents unless authorised to do so. We may need to monitor emails we send and
Re: [Linux-HA] Stickiness confusion
On 10/11/2012 12:43 AM, Kevin F. La Barre wrote: I'm testing stickiness in a sandbox that consists of 3 nodes. The configuration is very simple but it's not acting the way I think it should. My configuration: # crm configure show node hasb1 node hasb2 node hasb3 primitive postfix lsb:postfix \ op monitor interval=15s property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1349902760 \ maintenance-mode=false \ is-managed-default=true rsc_defaults $id=rsc-options \ resource-stickiness=100 The test resource postfix lives on hasb1. # crm_simulate -sL Current cluster status: Online: [ hasb1 hasb3 hasb2 ] postfix(lsb:postfix): Started hasb1 Allocation scores: native_color: postfix allocation score on hasb1: 100 native_color: postfix allocation score on hasb2: 0 native_color: postfix allocation score on hasb3: 0 On hasb1 I'll kill the corosync process. Resource moves over to hasb2 as expected. So cluster processes are killed and the resource keeps running on hasb1 ... and starts a second time on hasb2 as hasb1 is still running and you have no stonith ... # crm status Last updated: Wed Oct 10 22:35:23 2012 Last change: Wed Oct 10 21:30:12 2012 via crm_resource on hasb2 Stack: openais Current DC: hasb2 - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 3 Nodes configured, 3 expected votes 1 Resources configured. Online: [ hasb3 hasb2 ] OFFLINE: [ hasb1 ] postfix(lsb:postfix): Started hasb2 # crm_simulate -sL Current cluster status: Online: [ hasb3 hasb2 ] OFFLINE: [ hasb1 ] postfix(lsb:postfix): Started hasb2 Allocation scores: native_color: postfix allocation score on hasb1: 0 native_color: postfix allocation score on hasb2: 100 native_color: postfix allocation score on hasb3: 0 Now I'll start corosync pacemaker. Postfix resource moves back to hasb1 even though we have default stickiness. in your logs you will see the cluster detecting postfix running twice and do a stop all/start one by default ... really stop/reset a server if you want to test node failures, like: echo b /proc/sysrq-trigger ... and use stonith! Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now # crm status Last updated: Wed Oct 10 22:37:00 2012 Last change: Wed Oct 10 21:30:12 2012 via crm_resource on hasb2 Stack: openais Current DC: hasb2 - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 3 Nodes configured, 3 expected votes 1 Resources configured. Online: [ hasb1 hasb3 hasb2 ] postfix(lsb:postfix): Started hasb1 # crm_simulate -sL Current cluster status: Online: [ hasb1 hasb3 hasb2 ] postfix(lsb:postfix): Started hasb1 Allocation scores: native_color: postfix allocation score on hasb1: 100 native_color: postfix allocation score on hasb2: 0 native_color: postfix allocation score on hasb3: 0 What am I missing? I'm pulling my hair - any help would be appreciated greatly. Corosync 1.4.1 Pacemaker 1.1.7 CentOS 6.2 -Kevin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] STONITH for Amazon EC2 - fence_ec2
On 10/06/2012 08:45 AM, Kevin F. La Barre wrote: I'm trying to get the fence_ec2 agent (link below) working and a bit confused on how it should be configured. I have modified the agent with the EC2 key and cert, region, etc. The part of confused about is the port argument and how it's supposed to work. Am I supposed to hardcode the uname into the port variable or is this somehow passed into the script as an argument? If I hardcode it, I don't understand how Pacemaker passes on the information as to which node to kill. Versions and config. details follow. I apologize if this has been vague. Please let me know if you need more information. Fencing agent: https://github.com/beekhof/fence_ec2/blob/392a146b232fbf2bf2f75605b1e92baef4be4a01/fence_ec2 crm configure primitive ec2-fencing stonith::fence_ec2 \ params action=reboot \ op monitor interval=60s try something like: primitive stonith_my-ec2-nodes stonith:fence_ec2 \ params ec2-home=/root/.ec2 pcmk_host_check=static-list pcmk_host_list=myec2-01 myec2-02 \ op monitor interval=600s timeout=300s \ op start start-delay=30s interval=0 ... where the nodenames are sent as port paramter. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Corosync v1.4.1 Pacemaker v1.1.7 CentOS 6.2 -Kevin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat not starting when both nodes are down
On 10/08/2012 09:42 PM, Nicolás wrote: El 28/09/2012 20:42, Nicolás escribió: Hi all! I'm new to this list, I've been looking to get some info about this but I haven't seen anything, so I'm trying this way. I've successfully configured a 2-node cluster with DRBD + Heartbeat + Pacemaker. It works as expected. The problem comes when both nodes are down. Having this, after powering on one of the nodes, I can see it configuring the network but after this I never see the console for this machine. So I try to connect via SSH and realize that Heartbeat is not running. After I run it manually I can see the console for this node. This only happens when BOTH nodes are down. When just one is, everything goes right as Heartbeat starts automatically on the powering-on node. I see nothing relevant in logs, my conf is as follows: root@cluster1:~# cat /etc/ha.d/ha.cf | grep -e ^[^#] logfacility local0 ucast eth1 192.168.0.91 ucast eth0 192.168.20.51 auto_failback on nodecluster1.gamez.es cluster2.gamez.es use_logd yes crm on autojoin none Any ideas on what am I doing wrong? Looks like enabled DRBD init script with default startup-timeout parameters ... that script blocks until peer is connected or timeout -- default forever (depending on some configuration parameters) or manual confirmation on console ... as heartbeat is typically last in boot process it is not (yet) started. For a new cluster use Corosync and not Heartbeat,disable DRBD init script and configure it as a Pacemaker master-slave resource. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks a lot in advance. Nicolás ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems Any ideas with this? Thanks! ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] STONITH for Amazon EC2 - fence_ec2
On 10/08/2012 10:44 PM, Kevin F. La Barre wrote: Thank you Andreas, your input does further my understanding of how this is supposed to work; however, I'm still unclear about your statement where the nodenames are sent as port parameter. Specifically, how is the port parameter passed from Pacemaker to the fencing script? Should port be specified in the primitive declaration, and if so, since it's dynamic we cannot simply set it to a static node-name, right? Also, if you look at the code for fence_ec2 provided in the link within my original email the port variable is essentially empty, port=. If we leave this empty the script will never receive the node name, but we cannot statically populate the variable. I would think that node A or node B would be the ones fencing Node C. Node C after-all may not be able to communicate. Pacemaker also sends the cluster node name to fence to the stonith agent, that is interpreted as port. I am sure you read the description about how to use this agent and how it tries to find the correct EC2 instance name to fence, using the node name. Regards, Andreas The above comes from the fact that the node/port is not being passed in my configuration for some reason. I'm seeing errors to the effect that INSTANCE (the ec2 instance, aka port) is not being specified. Any help is much appreciated! Respectfully, Kevin On Mon, Oct 8, 2012 at 2:10 PM, Andreas Kurz andr...@hastexo.com wrote: On 10/06/2012 08:45 AM, Kevin F. La Barre wrote: I'm trying to get the fence_ec2 agent (link below) working and a bit confused on how it should be configured. I have modified the agent with the EC2 key and cert, region, etc. The part of confused about is the port argument and how it's supposed to work. Am I supposed to hardcode the uname into the port variable or is this somehow passed into the script as an argument? If I hardcode it, I don't understand how Pacemaker passes on the information as to which node to kill. Versions and config. details follow. I apologize if this has been vague. Please let me know if you need more information. Fencing agent: https://github.com/beekhof/fence_ec2/blob/392a146b232fbf2bf2f75605b1e92baef4be4a01/fence_ec2 crm configure primitive ec2-fencing stonith::fence_ec2 \ params action=reboot \ op monitor interval=60s try something like: primitive stonith_my-ec2-nodes stonith:fence_ec2 \ params ec2-home=/root/.ec2 pcmk_host_check=static-list pcmk_host_list=myec2-01 myec2-02 \ op monitor interval=600s timeout=300s \ op start start-delay=30s interval=0 ... where the nodenames are sent as port paramter. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Corosync v1.4.1 Pacemaker v1.1.7 CentOS 6.2 -Kevin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Active/Active Cluster for DRBD partition
On 07/24/2012 10:21 AM, Yount, William D wrote: I have two servers: KNTCLFS001 and KNTCLFS002 I have a drbd partition named nfs, on each server. They are mirrored. The mirroring works perfectly. What I want is to serve this drbd partition up and have it so that if one server goes down, the drbd partition is still available on the other server. I am trying to do this in an active/active cluster. I have a cloned IP address that is supposed to be running on both machines at the same time. I can get the resources setup according to the Clusters from Scratch guide. The issue is that as soon as I take KNTCLFS001 down, all my resources go down as well. They won't even stay running on KNTCLFS002. Your configuration does not look like you followed Clusters from Scratch ... drbd has to be a master-slave resource and not a clone, the filesystem and IP needs to be ordered to start after drbd promotion and they need colocation constraints ... that is all step by step explained in this document. And you miss STONITH, a must-have for dual-primary drbd. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Someone suggested adding constraints but that doesn't really make sense. I want the services running on all systems at once. Constraints are preconditions for bringing the service up on another machine, but the service should already be running so in theory, constraints shouldn't be necessary. Any help will be greatly appreciated. I have attached my cib.xml to give anyone a better idea of what I am trying to do. I have been going through the Pacemaker 1.1 Configuration Explained but so far I haven't found anything. William ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OpenSuSE 12.1 and Heartbeat
On 07/18/2012 12:35 PM, claudi...@mediaservice.net wrote: Hi to all, Anyone have build Heartbeat on OpenSuSE 12.1 ? In this new version of SuSE, Heartbeat is not maintained anymore, but i have to upgrade 2 systems using Heartbeat... Seems to be a good chance to switch to Corosync. If you (hopefully) already run Pacemaker, the switch is quite easy. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Cordially, Claudio Prono. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker/Corosync issue on Amazon VPC (ec2)
On 06/29/2012 05:22 PM, Heitor Lessa wrote: Hi, I have installed DRBD+OCFS2 and working Amazon EC2, however as a previous thread suggested we should use Pacemaker in order to get OCFS modified in runtime (modify/del nodes). Pacemaker/corosync and other components were very straight forward installing via Lucid-Cluster and Ubuntu-HA, but at the first steps I experienced some problems with CoroSync regarding network connectivity. Unfortunately, Amazon does not allow Multicast, so I used udpu once it would be the only way to get it working, but when I started I got same error on logs Even with all traffic allowed, no apparmor (ubuntu), no iptables locally at all: Jun 29 15:11:11 corosync [TOTEM ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. Just for sake, I used iperf and netcat to send UDP packets and it is working fine in several ports, so we can rule out firewall issue. Any thoughts? yes .. security groups, adjustable in your EC2 management console. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thank you very much. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] WARNING: Resources violate uniqueness
On 06/29/2012 11:48 AM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote: Hi, I am in the process to setup a cluster using HP iLo3 as stonith devices using SLES11 SP2 HA Extension. (I used formerly heartbeat-1 clusters successfully). I defined the following primitives for stonith fencing. primitive st-ilo-rt-lxcl9ar stonith:ipmilan \ params hostname=rt-lxcl9ar.de.bosch.com ipaddr=10.13.172.85 port=623 auth=straight priv=admin login=stonith password=secret primitive st-ilo-rt-lxcl9br stonith:ipmilan \ params hostname=rt-lxcl9br.de.bosch.com ipaddr=10.13.172.93 port=623 auth=straight priv=admin login=stonith password=secret Using the location keyword I made sure that the stonith is not running on its own node: location l-st-rt-lxcl9a st-ilo-rt-lxcl9ar -inf: rt-lxcl9a location l-st-rt-lxcl9b st-ilo-rt-lxcl9br -inf: rt-lxcl9b crm(live)configure# verify leads to some warnings: WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for parameter port: 623 WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for parameter password: secret WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for parameter auth: straight WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for parameter priv: admin I now have two questions. 1. Why does crm the two distrinct primitives st-ilo-rt-lxcl9ar and st-ilo-rt-lxcl9br to have different parameters? Is this an error in stonith:ipmilan? yes, this is an error in the metadata of ipmilan ... but you should still be able to commit that configuration. 2. Is stonith:ipmilan the correct stonith driver for HP iLo3 (DL380 G7)? If yes, how to test if it is really working? well, I personally prefer external/ipmi ... used it on several management cards including ilos without a problem. You can use the stonith command to test it prior to cluster integration or -- once configured in pacemaker -- you can also do a kill -9 corosync on a node and hopefully see it being fenced. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Best regards Martin Konold Robert Bosch GmbH Automotive Electronics Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com Tel. +49 7121 35 3322 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph Kübel, Uwe Raschke, Wolf-Henning Scheider, Werner Struth, Peter Tyroller ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] WARNING: Resources violate uniqueness
On 06/29/2012 12:33 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote: Hi, thanks for the clarification! 1. Why does crm the two distrinct primitives st-ilo-rt-lxcl9ar and st-ilo-rt-lxcl9br to have different parameters? Is this an error in stonith:ipmilan? yes, this is an error in the metadata of ipmilan ... but you should still be able to commit that configuration. Yes, the configuration can be commited without problem. 2. Is stonith:ipmilan the correct stonith driver for HP iLo3 (DL380 G7)? If yes, how to test if it is really working? well, I personally prefer external/ipmi ... used it on several management cards including ilos without a problem. On the commandline I used successfully: ipmitool -H rt-lxcl9br.de.bosch.com -I lanplus -A PASSWORD -U stonith -P somepassword power status. -l lanplus is mandatory to get it working with iLo3 (Firmware 1.28) Can I make uses of this command using external/ipmi or is lanplus not supported by external/ipmi? it is supported When executing # stonith -t external/ipmi -n hostname ipaddr userid passwd interface interface is the IPMI interface ... lan or lanplus I cannot see how to configure lanplus. crm ra info stonith:external/ipmi Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Best regards Martin Konold Robert Bosch GmbH Automotive Electronics Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com Tel. +49 7121 35 3322 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph Kübel, Uwe Raschke, Wolf-Henning Scheider, Werner Struth, Peter Tyroller ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith:external/ipmi was WARNING: Resources violate uniqueness
On 06/29/2012 03:53 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote: Hi Andreas, thank you very much. Stonith works nicely when doing the 'kill -9 corosync' tests. When looking at /var/log/messages I can see entries like Jun 29 15:12:07 rt-lxcl9a stonith-ng: [12589]: WARN: parse_host_line: Could not parse (0 0): Hmm ... well, if it works ... you can open a support request for your enterprise server I am wondering what causes this warning. primitive stonith-ilo-rt-lxcl9ar stonith:external/ipmi \ params hostname=rt-lxcl9a ipaddr=10.13.172.85 userid=stonith passwd=XXX passwd_method=param interface=lanplus primitive stonith-ilo-rt-lxcl9br stonith:external/ipmi \ params hostname=rt-lxcl9b ipaddr=10.13.172.93 userid=stonith passwd=XXX passwd_method=param interface=lanplus looks fine ... it should not be needed but you can try to add (of course for both devices): pcmk_host_check=static-list pcmk_host_list=rt-lxcl9a ... to the params list. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Best regards Martin Konold Robert Bosch GmbH Automotive Electronics Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com Tel. +49 7121 35 3322 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph Kübel, Uwe Raschke, Wolf-Henning Scheider, Werner Struth, Peter Tyroller -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andreas Kurz Sent: Freitag, 29. Juni 2012 13:06 To: linux-ha@lists.linux-ha.org Subject: Re: [Linux-HA] WARNING: Resources violate uniqueness On 06/29/2012 12:33 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote: Hi, thanks for the clarification! 1. Why does crm the two distrinct primitives st-ilo-rt-lxcl9ar and st-ilo-rt-lxcl9br to have different parameters? Is this an error in stonith:ipmilan? yes, this is an error in the metadata of ipmilan ... but you should still be able to commit that configuration. Yes, the configuration can be commited without problem. 2. Is stonith:ipmilan the correct stonith driver for HP iLo3 (DL380 G7)? If yes, how to test if it is really working? well, I personally prefer external/ipmi ... used it on several management cards including ilos without a problem. On the commandline I used successfully: ipmitool -H rt-lxcl9br.de.bosch.com -I lanplus -A PASSWORD -U stonith -P somepassword power status. -l lanplus is mandatory to get it working with iLo3 (Firmware 1.28) Can I make uses of this command using external/ipmi or is lanplus not supported by external/ipmi? it is supported When executing # stonith -t external/ipmi -n hostname ipaddr userid passwd interface interface is the IPMI interface ... lan or lanplus I cannot see how to configure lanplus. crm ra info stonith:external/ipmi Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Best regards Martin Konold Robert Bosch GmbH Automotive Electronics Postfach 13 42 72703 Reutlingen GERMANY www.bosch.com Tel. +49 7121 35 3322 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph Kübel, Uwe Raschke, Wolf-Henning Scheider, Werner Struth, Peter Tyroller ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] drbd primary/primary for ocfs2 and undetected split brain
On 06/29/2012 04:02 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote: Hi, I am experiencing an error situation which gets not detected by the cluster. I created a 2-node cluster using drbd and want to use ocfs2 on both nodes simultaneously. (stripped off some monitor/meta stuff) bad idea ... pretty useless without the full configuration, especially the meta attributes in this case. Also share your drbd and corosycn configuration please. BTW: what is your use case to start with a simple dual-primary OCFS2 setup? primitive dlm ocf:pacemaker:controld primitive o2cb ocf:ocfs2:o2cb primitive resDRBD ocf:linbit:drbd \ params drbd_resource=r0 \ operations $id=resDRBD-operations primitive resource-fs ocf:heartbeat:Filesystem \ params device=/dev/drbd_r0 directory=/SHARED fstype=ocfs2 ms msDRBD resDRBD clone clone-dlm dlm clone clone-fs resource-fs clone clone-ocb o2cb colocation colocation-dlm-drbd inf: clone-dlm msDRBD:Master colocation colocation-fs-o2cb inf: clone-fs clone-ocb colocation colocation-ocation-dlm inf: clone-ocb clone-dlm order order-dlm-o2cb 0: clone-dlm clone-ocb order order-drbd-dlm 0: msDRBD:promote clone-dlm:start order order-o2cb-fs 0: clone-ocb clone-fs The cluster starts up happily. (everything green in crm_gui) but rt-lxcl9a:~ # drbd-overview 0:r0/0 WFConnection Primary/Unknown UpToDate/DUnknown C r- rt-lxcl9b:~ # drbd-overview 0:r0/0 StandAlone Primary/Unknown UpToDate/DUnknown r- As you can see this is a split brain situation with both nodes having the ocfs2 fs mounted but not in sync -- data loss will happen. 1. How to avoid split brain situations (I am confident that the cross link using a 10GB cable was never interrupted)? logs should reveal what happend 2. How to resolve this? Switch cluster in maintenance mode and then follow http://www.drbd.org/users-guide/s-resolve-split-brain.html ? at least you also need to stop the filesystem if it is running and you want to demote one Primary ... and then follow that link 3. How to make the cluster aware of the split brain situation? (It thinks everything is fine) setup fening method resource-and-stonith in drbd configuration, preferable use the crm-fence-peer.sh stonith script ... Pacemaker itself or better the DRBD resource agent will not react on such a situation. 4. Should the DRBD(OCFS2 setup be maintained outside the cluster instead? better not ;-) Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Mit freundlichen Grüßen / Best regards Martin Konold Robert Bosch GmbH Automotive Electronics Postfach 13 42 72703 Reutlingen GERMANY www.bosch.comhttp://www.bosch.com Tel. +49 7121 35 3322 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000; Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz Fehrenbach, Siegfried Dais; Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph Kübel, Uwe Raschke, Wolf-Henning Scheider, Werner Struth, Peter Tyroller ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Nodes not seeing each other
On 06/27/2012 12:14 AM, Marcus Bointon wrote: On 26 Jun 2012, at 22:18, Andreas Kurz wrote: use STONITH to prevent resources running on both nodes ... you configured redundant cluster communication paths? The nodes in question are Linode VMs, so not much opportunity for that. With heartbeat you can use the cl_status command with its various options to check Heartbeats view of the cluster and heartbeats log messages from the split-brain event should also give you some hints. cl_status just confirms that each node thinks the other is dead. ok, I see two things happening in the logs: At one point proxy2 reported a slow heartbeat (20sec, deadtime was set to 15) but seemed to reconnect. Later on, both nodes reported each other as dead within the same second: Jun 25 10:14:16 proxy1 heartbeat: [2678]: WARN: node proxy2.example.com: is dead Jun 25 10:14:16 proxy1 heartbeat: [2678]: info: Link proxy2.example.com:eth0 dead. Jun 25 10:14:16 proxy1 crmd: [3205]: notice: crmd_ha_status_callback: Status update: Node proxy2.example.com now has status [dead] looks like a network problem, yes As I understand it, STONITH is intended to prevent a node rejoining in case it causes more trouble. In this case the individual nodes were fine, it appeared to be the network that was at fault. Why wouldn't these nodes automatically reconnect, given that there is no STONITH to prevent them? How should I tell them to reconnect manually? STONITH is to make sure a node is really dead before acquiring its resources ... without stonith and ignored quorum, nodes don't care. If the network is working as expected again, Heartbeat should reconnect automatically ... if not, restart Heartbeat if you are confident the network problem is solved. Regards, Andreas I can also see that it failed to send alerts from the email resources at the same time because DNS lookups were failing: all points to a wider network issue. I wonder if Linode has micro-outages on their network since we've also been seeing some problems with mmm reporting 'network unreachable' on some other instances at the same time. Marcus -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] What's the meaning of ... Failed application of an update diff
On 06/20/2012 04:35 PM, alain.mou...@bull.net wrote: Hi hb_report does not work. how to do a report tarball ? It has been renamed to crm_report Regards, Andreas Thanks Alain De :Andrew Beekhof and...@beekhof.net A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Date : 19/06/2012 11:37 Objet : Re: [Linux-HA] What's the meaning of ... Failed application of an update diff Envoyé par :linux-ha-boun...@lists.linux-ha.org On Tue, Jun 19, 2012 at 6:29 PM, Lars Marowsky-Bree l...@suse.com wrote: On 2012-06-19T08:38:11, alain.mou...@bull.net wrote: So that means that my modifications by crm configure edit , even if they are correct (I've re-checked them) , have potentially corrupt the Pacemaker configuration ? No. The CIB automatically recovers from this by doing a full sync. The messages are harmless and only indicate an inefficiency, not a real problem. They could be indicative of a bug. I wouldn't mind seeing a report tarball (maybe file a bug for it). Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2)
On 06/19/2012 09:22 AM, alain.mou...@bull.net wrote: Hi Sorry but I don't know what iirc means , I suppose iirc in this context stands for plugin. If so, how can I check for sure that the plugin is or is not in the pacemaker package ? (it is to check the pacemaker package delivered with the new RH 6.3) Check the pacemaker package if it contains the /etc/init.d/pacemaker init script, that start the MCP ... if yes, then adopt your corosync.conf as I wrote in my previous mail. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks Alain De :Andrew Beekhof and...@beekhof.net A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Date : 18/06/2012 23:38 Objet : Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2) Envoyé par :linux-ha-boun...@lists.linux-ha.org On Mon, Jun 18, 2012 at 11:04 PM, alain.mou...@bull.net wrote: Hi again, could you tell me the package which install the pacemaker plugin v1 and which is the name of the binary or binaries or src ? its the pacemaker source rpm. it doesn't ship by default iirc Thanks a lot Alain De :Andrew Beekhof and...@beekhof.net A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Date : 16/06/2012 12:25 Objet : Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2) Envoyé par :linux-ha-boun...@lists.linux-ha.org On Fri, Jun 15, 2012 at 10:06 PM, alain.mou...@bull.net wrote: Hi Andrew you recall me in an old thread here that effectively cman was not involved in option 4 : corosync + cpg + quorumd + mcp whereas it is involved in option 3 : corosync + cpg + cman + mcp but is seems that corosync is also used in both options . cman is just a corosync plugin. think of cman being an alias for corosync + cman plugin I tried to configure option 3 as you've seen in my other email two days ago, and we only have a mini cluster.conf file , and no more corosync.conf (and it works once I start Pacemaker after cman ;-) ) My question is now : when the option 4 will be available, we will come back to the corosync.conf file ? yes as same as with option 2 and no more cluster.conf ? right And to be completely clear on why my question : the temporary option 3 forces us to use a mini cluster.conf, and therefore only one heartbeat network (or two but with bonding). I'm pretty sure you can have redundant rings with cluster.conf, I just don't know the details. But if in the future we configure option 4, and come back to corosync.conf, we will be able to have again two networks rings in the corosync.conf, and so ... that sounds be much better for me. Excpet if quorumd is working with a mini cluster.conf like cman ? No. Just corosync.conf You can get a preview of how option 4 works here: https://www.dropbox.com/s/zd1mi6u1m7ac5t9/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf I need to finish it off and push to clusterlabs... Thanks for these precisions. Regards Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop
On 06/19/2012 04:00 AM, Martin Marji Cermak wrote: Hello guys, I have 3 questions if you please. I have a HA NFS cluster - Centos 6.2, pacemaker, corosync, two NFS nodes plus 1 quorum node, in semi Active-Active configuration. By semi, I mean that both NFS nodes are active and each of them is under normal circumstances exclusively responsible for one (out of two) Volume Group - using the ocf:heartbeat:LVM RA. Each LVM volume group lives on a dedicated multipath iscsi device, exported from a shared SAN. I'm exporting a NFSv3/v4 export (/srv/nfs/software_repos directory). I need to make it available for 2 separate /21 networks as read-only, and for 3 different servers as read-write. I'm using the ocf:heartbeat:exportfs RA and it seems to me I have to use the ocf:heartbeat:exportfs RA 5 times. If you want to use NFSv4 you also need to export the virtual nfs file-system root with fsid=0. In its current incarnation, the exportfs RA does not allow to summarize different clients with the same export options in one primitive configuration, though the exportfs command would support it ... patches are welcome ;-) The configuration (only IP addresses changed) is here: http://pastebin.com/eHkgUv64 1) is there a way how to export this directory 5 times without defining 5 ocf:heartbeat:exportfs primitives? It's a lot of duplications... I search all the forums and I fear the ocf:heartbeat:exportfs simply supports only one host / network range. But maybe someone has been working on a patch? see above ... you maybe can safe a little bit duplications by using id-refs ... 2) while using the ocf:heartbeat:exportfs 5 times for the same directory, do I have to use the _same_ FSID (201 in my config) for all these 5 primitives (as Im exporting the _same_ filesystem / directory)? I'm getting this warning when doing so WARNING: Resources p_exportfs_software_repos_ae1,p_exportfs_software_repos_ae2,p_exportfs_software_repos_buller,p_exportfs_software_repos_iap-mgmt,p_exportfs_software_repos_youyangs violate uniqueness for parameter fsid: 201 Do you still want to commit? It is only a warning and as you said, it is the same filesystem. 3) wait_for_leasetime_on_stop - I believe this must be set to true when exporting NFSv4 with ocf:heartbeat:exportfs. http://www.linux-ha.org/doc/man-pages/re-ra-exportfs.html My 5 exportfs primitives reside in the same group: group g_nas02 p_lvm02 p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller p_fs_software_repos p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 p_exportfs_software_repos_iap-mgmt p_ip02 \ meta resource-stickiness=101 Even though I have the /proc/fs/nfsd/nfsv4gracetime set to 10 seconds, a failover of the NFS group from one NFS node to the second node would take more than 50 seconds, as it will be waiting for each ocf:heartbeat:exportfs resource sleeping 10 seconds 5 times. Is there any way of making them fail over / sleeping in parallel, instead of sequential? use resource sets like: order o_nas02 inf: p_lvm02 ( p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller p_fs_software_repos p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 p_exportfs_software_repos_iap-mgmt ) p_ip02 colocation co_nas02 inf: p_lvm02 p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller p_fs_software_repos p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 p_exportfs_software_repos_iap-mgmt p_ip02 this allows a parallel start and stop of all exports Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now I workarounded this by setting wait_for_leasetime_on_stop=true for only one of these (which I believe is safe and does the job it is expected to do - please correct me if I'm wrong). Thank you for your valuable comments. My Pacemaker configuration: http://pastebin.com/eHkgUv64 [root@irvine ~]# facter | egrep 'lsbdistid|lsbdistrelease' lsbdistid = CentOS lsbdistrelease = 6.2 [root@irvine ~]# rpm -qa | egrep 'pacemaker|corosync|agents' corosync-1.4.1-4.el6_2.2.x86_64 pacemaker-cli-1.1.6-3.el6.x86_64 pacemaker-libs-1.1.6-3.el6.x86_64 corosynclib-1.4.1-4.el6_2.2.x86_64 pacemaker-cluster-libs-1.1.6-3.el6.x86_64 pacemaker-1.1.6-3.el6.x86_64 fence-agents-3.1.5-10.el6_2.2.x86_64 resource-agents-3.9.2-7.el6.x86_64 with /usr/lib/ocf/resource.d/heartbeat/exportfs updated by hand from: https://github.com/ClusterLabs/resource-agents/commits/master/heartbeat/exportfs Thank you very much Marji Cermak ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org
Re: [Linux-HA] ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop
On 06/19/2012 10:48 AM, Andreas Kurz wrote: On 06/19/2012 04:00 AM, Martin Marji Cermak wrote: Hello guys, I have 3 questions if you please. I have a HA NFS cluster - Centos 6.2, pacemaker, corosync, two NFS nodes plus 1 quorum node, in semi Active-Active configuration. By semi, I mean that both NFS nodes are active and each of them is under normal circumstances exclusively responsible for one (out of two) Volume Group - using the ocf:heartbeat:LVM RA. Each LVM volume group lives on a dedicated multipath iscsi device, exported from a shared SAN. I'm exporting a NFSv3/v4 export (/srv/nfs/software_repos directory). I need to make it available for 2 separate /21 networks as read-only, and for 3 different servers as read-write. I'm using the ocf:heartbeat:exportfs RA and it seems to me I have to use the ocf:heartbeat:exportfs RA 5 times. If you want to use NFSv4 you also need to export the virtual nfs file-system root with fsid=0. In its current incarnation, the exportfs RA does not allow to summarize different clients with the same export options in one primitive configuration, though the exportfs command would support it ... patches are welcome ;-) The configuration (only IP addresses changed) is here: http://pastebin.com/eHkgUv64 1) is there a way how to export this directory 5 times without defining 5 ocf:heartbeat:exportfs primitives? It's a lot of duplications... I search all the forums and I fear the ocf:heartbeat:exportfs simply supports only one host / network range. But maybe someone has been working on a patch? see above ... you maybe can safe a little bit duplications by using id-refs ... 2) while using the ocf:heartbeat:exportfs 5 times for the same directory, do I have to use the _same_ FSID (201 in my config) for all these 5 primitives (as Im exporting the _same_ filesystem / directory)? I'm getting this warning when doing so WARNING: Resources p_exportfs_software_repos_ae1,p_exportfs_software_repos_ae2,p_exportfs_software_repos_buller,p_exportfs_software_repos_iap-mgmt,p_exportfs_software_repos_youyangs violate uniqueness for parameter fsid: 201 Do you still want to commit? It is only a warning and as you said, it is the same filesystem. 3) wait_for_leasetime_on_stop - I believe this must be set to true when exporting NFSv4 with ocf:heartbeat:exportfs. http://www.linux-ha.org/doc/man-pages/re-ra-exportfs.html My 5 exportfs primitives reside in the same group: group g_nas02 p_lvm02 p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller p_fs_software_repos p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 p_exportfs_software_repos_iap-mgmt p_ip02 \ meta resource-stickiness=101 Even though I have the /proc/fs/nfsd/nfsv4gracetime set to 10 seconds, a failover of the NFS group from one NFS node to the second node would take more than 50 seconds, as it will be waiting for each ocf:heartbeat:exportfs resource sleeping 10 seconds 5 times. Is there any way of making them fail over / sleeping in parallel, instead of sequential? use resource sets like: small correction of myself ;-) ... filesystem-mount has to be before the exports of course: order o_nas02 inf: p_lvm02 ( p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller p_fs_software_repos p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 p_exportfs_software_repos_iap-mgmt ) p_ip02 order o_nas02 inf: p_lvm02 p_fs_software_repos \ (p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller \ p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 \ p_exportfs_software_repos_iap-mgmt ) p_ip02 colocation co_nas02 inf: p_lvm02 p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller p_fs_software_repos p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 p_exportfs_software_repos_iap-mgmt p_ip02 colocation co_nas02 inf: p_lvm02 p_fs_software_repos \ p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller \ p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 \ p_exportfs_software_repos_iap-mgmt p_ip02 Regards, Andreas this allows a parallel start and stop of all exports Regards, Andreas signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2)
On 06/18/2012 03:04 PM, alain.mou...@bull.net wrote: Hi again, could you tell me the package which install the pacemaker plugin v1 and which is the name of the binary or binaries or src ? You only need to add: service { # Load the Pacemaker Cluster Resource Manager ver: 1 name: pacemaker } to your corosync.conf ... or create a file with this content in /etc/corosync/service.d/. Once you started Corosync you need to start the pacemaker init script, that starts the MCP ... and stop that services in reverse order. The init script is part of the pacemaker package on RHEL 6.x Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks a lot Alain De :Andrew Beekhof and...@beekhof.net A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Date : 16/06/2012 12:25 Objet : Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2) Envoyé par :linux-ha-boun...@lists.linux-ha.org On Fri, Jun 15, 2012 at 10:06 PM, alain.mou...@bull.net wrote: Hi Andrew you recall me in an old thread here that effectively cman was not involved in option 4 : corosync + cpg + quorumd + mcp whereas it is involved in option 3 : corosync + cpg + cman + mcp but is seems that corosync is also used in both options . cman is just a corosync plugin. think of cman being an alias for corosync + cman plugin I tried to configure option 3 as you've seen in my other email two days ago, and we only have a mini cluster.conf file , and no more corosync.conf (and it works once I start Pacemaker after cman ;-) ) My question is now : when the option 4 will be available, we will come back to the corosync.conf file ? yes as same as with option 2 and no more cluster.conf ? right And to be completely clear on why my question : the temporary option 3 forces us to use a mini cluster.conf, and therefore only one heartbeat network (or two but with bonding). I'm pretty sure you can have redundant rings with cluster.conf, I just don't know the details. But if in the future we configure option 4, and come back to corosync.conf, we will be able to have again two networks rings in the corosync.conf, and so ... that sounds be much better for me. Excpet if quorumd is working with a mini cluster.conf like cman ? No. Just corosync.conf You can get a preview of how option 4 works here: https://www.dropbox.com/s/zd1mi6u1m7ac5t9/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf I need to finish it off and push to clusterlabs... Thanks for these precisions. Regards Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Upgrade pacemaker 1.0.9.1 to squeeze-backports?
Hi, On 06/12/2012 04:42 PM, Helmut Wollmersdorfer wrote: Hi, are there any known problems with an upgrade from stock debian squeeze to squeeze backports, i.e. No, it just works ;-) from ii pacemaker 1.0.9.1+hg15626-1 HA cluster resource manager ii heartbeat 1:3.0.3-2Subsystem for High- Availability Linux to pacemaker (1.1.7-1~bpo60+1) heartbeat (1:3.0.5-2~bpo60+1) ... and cluster-glue, resource-agents You really want to start thinking about migrating to Corosync, though Heartbeat still works (up to now). As I have to do a hardware upgrade of both nodes of a 2-node Xen-DRBD- cluster, should a upgrade pacemaker first with just apt.get ... or what is the recommended procedure (AFAIK pacemaker 1.0.9 does not support 'maintenance-mode'). maintenance-mode should work fine in Pacemaker 1.0.9, do you have problems enabling it? ... crm configure property maintenance-mode=true Best Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now TIA Helmut Wollmersdorfer ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] VirtualDomain stop error
Hi Markus, On 05/30/2012 01:45 PM, Markus Knaup wrote: Pawel Warowny warp at master.pl writes: Dnia Fri, 1 Jul 2011 12:57:32 +0200 Pawel Warowny warp at master.pl napisał(a): Hi Errors are always the same: cannot send monitor command '{execute:query-balloon}': Connection reset by peer Because no one answered, should I report a bug about VirtualDomain resource agent and what's the proper way to do it? Best regards Hi Pawel, I have the same problem with my setup (DRBD with a virtual machine with KVM running, Pacemaker and Corosync controlling the cluster). When I stop the vm, sometimes an error occurs and the vm will not be migrated to the other node. I have to clean the resource by hand. Did you find a solution? Best regars Can you give us some more information? what are the errors that occur? ... grep your logs for VirtualDomain ... do you use the latest version of the RA? And what is the output of crm_mon -1fr after this error happened? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Markus ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Error while creating various disk volumes
On 05/22/2012 02:12 PM, Net Warrior wrote: Hi there Reading the documentantion I found that I can have various disk in the configuration, and I need to do so, this is my conf resource myresource { syncer { rate 100M; } volume 0 { device/dev/drbd1; disk /dev/rootvg/lvu02; meta-disk /dev/rootvg/drbdmetadata[0]; } volume 1 { device/dev/drbd2; disk /dev/rootvg/lvarch; meta-disk /dev/rootvg/drbdmetadata[0]; } on node1 { address x.x.x.x:7789; } on node2 { address x.x.x.x:7789; } } When creating the resource I get the following error drbd.d/myresource.res:7: Parse error: 'protocol | on | disk | net | syncer | startup | handlers | ignore-on | stacked-on-top-of' expected, but got 'volume' (TK 281) this is a DRBD 8.4 feature Im using this version drbd83-8.3.8-1.el4_8 ** Any help on this? Please read the DRBD users guide for version 8.3 and _not_ 8.4 ... http://www.drbd.org/users-guide-8.3/ Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks you very much for your time and support Regards ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-ha-dev] Rename parameter of conntrackd RA
Hi all, Dejan asked me to bring this up on mailing-list. Please have a look at this proposed patch for conntrackd. It renames the parameter conntrackd to binary and does a remap of conntrackd to binary if users of the old RA version use it. Background: Internally the RA expects a parameter called binary to be defined. So whatever value users specified in conntrackd parameter, it was ignored. A default value was used instead. Please share your thoughts on changing a parameter name of an already released resource agent. thx regards, Andreas diff --git a/heartbeat/conntrackd b/heartbeat/conntrackd index 7502f5a..3ee2f83 100755 --- a/heartbeat/conntrackd +++ b/heartbeat/conntrackd @@ -36,7 +36,10 @@ OCF_RESKEY_binary_default=conntrackd OCF_RESKEY_config_default=/etc/conntrackd/conntrackd.conf -: ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}} + +# For users of versions prior to 1.2: +# Map renamed parameter conntrackd to binary if in use +: ${OCF_RESKEY_binary=${OCF_RESKEY_conntrackd-${OCF_RESKEY_binary_default}}} : ${OCF_RESKEY_config=${OCF_RESKEY_config_default}} meta_data() { @@ -44,7 +47,7 @@ meta_data() { ?xml version=1.0? !DOCTYPE resource-agent SYSTEM ra-api-1.dtd resource-agent name=conntrackd -version1.1/version +version1.2/version longdesc lang=en Master/Slave OCF Resource Agent for conntrackd @@ -53,7 +56,7 @@ Master/Slave OCF Resource Agent for conntrackd shortdesc lang=enThis resource agent manages conntrackd/shortdesc parameters -parameter name=conntrackd +parameter name=binary longdesc lang=enName of the conntrackd executable. If conntrackd is installed and available in the default PATH, it is sufficient to configure the name of the binary For example my-conntrackd-binary-version-0.9.14 -- 1.7.4.1 signature.asc Description: OpenPGP digital signature ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-HA] Heartbeat Failover Configuration Question
On 04/23/2012 01:47 PM, Net Warrior wrote: True, but even on the most expensive software likve Veritas Cluster or Red Hat Cluster I can configure how I want to failover the resources ( auto or manual ), that's why my curiosity to acomplish the same in here. with the help of the meat-ware stonith plugin a manual acknowledge of the failover process is required. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks for your time Best Regards 2012/4/23, David Coulson da...@davidcoulson.net: Why even use heartbeat then - Just manually ifconfig the interface. On 4/23/12 7:39 AM, Net Warrior wrote: Hi Nikita This is the version heartbeat-3.0.0-0.7 My aim is to, if node1 is powered off or losts it's ethernet connection,. node2 wont make the failover automatically, I want to make it manually, but could not find how to accomplish that. Thanks for your time and support Best regards 2012/4/23, Nikita Michalkomichalko.sys...@a-i-p.com: Hi, Net Warrior! What version of HA/Pacemaker do you use? Did you already RTFM - e.g. http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained - or: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch HTH Nikita Michalko Am Montag, 23. April 2012 02:23:20 schrieb Net Warrior: Hi There I configured heartbeat to failover an IP address , if I for example shutdown one node, the other takes it's ip address, so far so good, now my doubt is if there is a way to configure it not to make the failover automatically and have someone run the failover manually, can you provide any configuration example please? is this stanza the one that does the magic? auto_failback on Thanks for your time and support Best regards ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] problem with pind
On 04/12/2012 02:59 PM, Trujillo Carmona, Antonio wrote: I'm try to configure a cluster and I have problem with pingd. my config is crm(live)configure# show node proxy-00 node proxy-01 primitive ip-segura ocf:heartbeat:IPaddr2 \ params ip=10.104.16.123 nic=lan cidr_netmask=19 \ op monitor interval=10 \ meta target-role=Started primitive pingd ocf:pacemaker:pingd \ use ocf:pacemaker:ping params host_list=10.104.16.157 \ and you have to define a monitor operation. Without any constraints to let the cluster react on connectivity changes ping resource is useless ... this may help: http://www.hastexo.com/resources/hints-and-kinks/network-connectivity-check-pacemaker Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now meta target-role=Started property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ stonith-enabled=false \ no-quorum-policy=ignore \ expected-quorum-votes=2 crm(live)# status Last updated: Thu Apr 12 14:54:21 2012 Last change: Thu Apr 12 14:40:00 2012 Stack: openais Current DC: proxy-00 - partition WITHOUT quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ proxy-00 ] OFFLINE: [ proxy-01 ] ip-segura(ocf::heartbeat:IPaddr2): Started proxy-00 Failed actions: pingd:0_monitor_0 (node=proxy-00, call=5, rc=2, status=complete): invalid parameter pingd_monitor_0 (node=proxy-00, call=8, rc=2, status=complete): invalid parameter crm(live)resource# start pingd crm(live)resource# status ip-segura(ocf::heartbeat:IPaddr2) Started pingd(ocf::pacemaker:pingd) Stopped and in the system log I got: Apr 12 14:55:18 proxy-00 crm_resource: [27941]: ERROR: unpack_rsc_op: Hard error - pingd:0_last_failure_0 failed with rc=2: Preventing pingd:0 from re-starting on proxy-00 Apr 12 14:55:18 proxy-00 crm_resource: [27941]: ERROR: unpack_rsc_op: Hard error - pingd_last_failure_0 failed with rc=2: Preventing pingd from re-starting on proxy-00 I have stoped node 2 in order to less problem ¿I can't found any reference to this error? ¿Can you help me? please. signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm configure primitive syntax, please HELP!!!
On 03/30/2012 01:14 PM, Guglielmo Abbruzzese wrote: Hi everybody, I've got a question probably very easy for someone I cant' find a proper answer in the official doc. In details, if I refer to the official pacemaker documentation related to the command crm configure primitive I find what follows: usage: primitive rsc [class:[provider:]]type [params param=value [param=value...]] [meta attribute=value [attribute=value...]] [utilization attribute=value [attribute=value...]] [operations id_spec [op op_type [attribute=value...] ...]] If I launch the following command by CLI: crm configure primitive resource_vrt_ip ocf:heartbeat:IPaddr2 params ip=192.168.15.73 nic=bond0 meta target-role=Stopped multiple-active=stop_start migration-treshold=3 failure-timeout=0 operations X op name=monitor interval=180 timeout=60 I get the following answer: ERROR: operations: only single $id or $id-ref attribute is allowed Now, I can't get the meaning of the parameter id_spec mentioned in the usage output. Could someone help me, or tell me what shell I write instead of in order to get the following output in the cib?? you can omit that operations X part Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now primitive class=ocf id=resource_vrt_ip provider=heartbeat type=IPaddr2 instance_attributes id=resource_vrt_ip-instance_attributes nvpair id=resource_vrt_ip-instance_attributes-ip name=ip value=192.168.15.73/ nvpair id=resource_vrt_ip-instance_attributes-nic name=nic value=bond0/ /instance_attributes meta_attributes id=resource_vrt_ip-meta_attributes nvpair id=resource_vrt_ip-meta_attributes-target-role name=target-role value=Stopped/ nvpair id=resource_vrt_ip-meta_attributes-multiple-active name=multiple-active value=stop_start/ nvpair id=resource_vrt_ip-meta_attributes-migration-threshold name=migration-threshold value=3/ nvpair id=resource_vrt_ip-meta_attributes-failure-timeout name=failure-timeout value=0/ /meta_attributes operations op id=resource_vrt_ip-startup interval=180 name=monitor timeout=60s/ /operations /primitive P.S. I'd prefer not to load a XML file, I already tried it and it works but it is not the purpose of my help request Thanks in advance Guglielmo Abbruzzese Project Leader RESI Informatica S.p.A. Via Pontina Km 44,044 04011 Aprilia (LT) - Italy Tel: +39 0692710 369 Fax: +39 0692710 208 Email: g.abbruzz...@resi.it Web: www.resi.it ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm don't conect to cluster
On 03/30/2012 08:27 AM, Trujillo Carmona, Antonio wrote: El mié, 28-03-2012 a las 14:09 +0200, Andreas Kurz escribió: ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems mensaje de correo electrónico adjunto - Mensaje reenviado Asunto: Fecha: Wed, 28 Mar 2012 14:46:15 +0200 For a new cluster you should go with Corosync ... choose either Heartbeat _or_ Corosync. Don't start both at the same time. Can you share the corosync.conf that produced that errors you showed in previous mails? Ok That's my idea, but how I can't make it run I began to test other thing even stupid ones. This is my corosync.conf Looks ok, if all nodes have the same file and the same (and correct) auth file all should run fine. Disable secauth and see if it still does not work Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now # cat /etc/corosync/corosync.conf # Please read the openais.conf.5 manual page totem { version: 2 # How long before declaring a token lost (ms) token: 3000 # How many token retransmits before forming a new configuration token_retransmits_before_loss_const: 10 # How long to wait for join messages in the membership protocol (ms) join: 60 # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms) consensus: 3600 # Turn off the virtual synchrony filter vsftype: none # Number of messages that may be sent by one processor on receipt of the token max_messages: 20 # Limit generated nodeids to 31-bits (positive signed integers) clear_node_high_bit: yes # Disable encryption secauth: on # How many threads to use for encryption/decryption threads: 0 # Optionally assign a fixed node id (integer) # nodeid: 1234 # This specifies the mode of redundant ring, which may be none, active, or passive. rrp_mode: none interface { # The following values need to be set based on your environment ringnumber: 0 bindnetaddr: 10.104.0.0 mcastaddr: 226.94.1.1 mcastport: 5405 } } amf { mode: disabled } service { # Load the Pacemaker Cluster Resource Manager ver: 0 name: pacemaker use_mgmtd: yes } aisexec { user: root group: root } corosync { user: root group: root } logging { fileline: off to_stderr: yes to_logfile: no to_syslog: yes syslog_facility: daemon debug: off timestamp: on logger_subsys { subsys: AMF debug: off tags: enter|leave|trace1|trace2|trace3|trace4|trace6 } } Thank signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm don't conect to cluster
On 03/28/2012 12:42 PM, Trujillo Carmona, Antonio wrote: Trying with other way I install heartbeat and right now it work: # crm crm(live)# status Last updated: Wed Mar 28 12:33:47 2012 Stack: Heartbeat Current DC: proxy-01 (d0034c01-f613-4d77-a390-05122f3a374c) - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 1 Nodes configured, unknown expected votes 0 Resources configured. Online: [ proxy-01 ] crm(live)# config ERROR: syntax: config crm(live)# configure INFO: building help index crm(live)configure# show node $id=d0034c01-f613-4d77-a390-05122f3a374c proxy-01 property $id=cib-bootstrap-options \ dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \ cluster-infrastructure=Heartbeat But I see that now cluster infrastructure is Heartbeat and not openais. So I figure out I have a miss-configuration of openais. Really I don't know witch one is better for me openais or heartbeat. I only want to check a service (squid) and if it fail change a virtual ip from production proxy to standby one. For a new cluster you should go with Corosync ... choose either Heartbeat _or_ Corosync. Don't start both at the same time. Can you share the corosync.conf that produced that errors you showed in previous mails? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] crm don't conect to cluster
On 03/27/2012 02:58 PM, Trujillo Carmona, Antonio wrote: I'm just try to configure a new cluster with pacemaker (based in debian stable). I follow same instruction I take in other cluster but I can't make it work. ¿Can you give me same path to check what is the problem? You get this totem messages on a node that tries to join? You checked all nodes have the same corosync configuration ... especially the same secauth options and (if in use) the same authkey? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now When I check crm with crm status I always get: root@proxy-00:/etc/network# crm crm(live)# status Connection to cluster failed: connection failed In the log I got: Mar 27 14:21:19 proxy-00 corosync[1465]: [TOTEM ] Type of received message is wrong... ignoring 86. Mar 27 14:21:19 proxy-00 corosync[1465]: [TOTEM ] Type of received message is wrong... ignoring 113. Mar 27 14:21:19 proxy-00 corosync[1465]: [TOTEM ] Type of received message is wrong... ignoring 18. Mar 27 14:21:19 proxy-00 crmd: [1501]: info: ais_dispatch: Membership 16: quorum still lost Mar 27 14:21:20 proxy-00 cib: [1540]: info: write_cib_contents: Wrote version 0.0.0 of the CIB to disk (digest: ce550593fab3e1d7832aa06b6df0621d) Mar 27 14:21:20 proxy-00 cib: [1540]: info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.yAAkTS (digest: /var/lib/heartbeat/crm/cib.IyTINP) Mar 27 14:21:20 proxy-00 corosync[1465]: [TOTEM ] Type of received message is wrong... ignoring 7. Mar 27 14:21:21 proxy-00 corosync[1465]: [TOTEM ] Type of received message is wrong... ignoring 91. Mar 27 14:21:21 proxy-00 corosync[1465]: [TOTEM ] Type of received message is wrong... ignoring 21. Mar 27 14:21:21 proxy-00 frox[1362]: Listening on 0.0.0.0:8021 Mar 27 14:21:21 proxy-00 frox[1362]: Dropped privileges Mar 27 14:21:21 proxy-00 corosync[1465]: [TOTEM ] Type of received message is wrong... ignoring 110. Mar 27 14:21:22 proxy-00 attrd: [1499]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: No such file or directory (2) Mar 27 14:21:22 proxy-00 attrd: [1499]: ERROR: ais_dispatch: AIS connection failed Mar 27 14:21:22 proxy-00 cib: [1497]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11) Mar 27 14:21:22 proxy-00 attrd: [1499]: CRIT: attrd_ais_destroy: Lost connection to OpenAIS service! Mar 27 14:21:22 proxy-00 cib: [1497]: ERROR: ais_dispatch: AIS connection failed Mar 27 14:21:22 proxy-00 attrd: [1499]: info: main: Exiting... Mar 27 14:21:22 proxy-00 cib: [1497]: ERROR: cib_ais_destroy: AIS connection terminated Mar 27 14:21:22 proxy-00 crmd: [1501]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: Resource temporarily unavailable (11) Mar 27 14:21:22 proxy-00 crmd: [1501]: ERROR: ais_dispatch: AIS connection failed Mar 27 14:21:22 proxy-00 crmd: [1501]: ERROR: crm_ais_destroy: AIS connection terminated Mar 27 14:21:22 proxy-00 stonithd: [1496]: ERROR: ais_dispatch: Receiving message body failed: (2) Library error: No such file or directory (2) Mar 27 14:21:22 proxy-00 stonithd: [1496]: ERROR: ais_dispatch: AIS connection failed Mar 27 14:21:22 proxy-00 stonithd: [1496]: ERROR: AIS connection terminated Mar 27 14:21:25 proxy-00 kernel: [ 18.780256] wan: no IPv6 routers present Mar 27 14:21:26 proxy-00 kernel: [ 19.860256] lan: no IPv6 routers present Thank. signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failed RAID leads to failed cluster
On 03/20/2012 10:41 AM, Christoph Bartoschek wrote: Hi, we have a two node nfs server setup. Each node has a RAID 6 with Adaptec hardware controller. DRBD synchronizes the blockdevice. On top of it there is a nfs server. Today our RAID controller failed on the master to rebuild after one harddisk had crashed and the device /dev/sdb1 got unavailable temporarily. I assume this is the case because of the following messages: Mar 20 04:01:58 laplace kernel: [1786373.892141] sd 0:0:1:0: [sdb] Very big device. Trying to use READ CAPACITY(16). Mar 20 04:05:47 laplace kernel: [1786602.053040] block drbd1: peer( Secondary - Unknown ) conn( Connected - TearDown ) pdsk( UpToDate - Outdated ) The cluster then detected failure and tried to promote the slave and to demote the master. This failed because lvm timeout out to get stopped on the master. I assume it tried to write something to the drbd device and failed resulting in the timeout. So my question is. What are we doing wrong? And how can we prevent the failure of the whole cluster in such a situation? Please share your drbd and cluster configuration ... two lines from log are not really enough to make suggestions based on facts. Best Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks Christoph ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Failed RAID leads to failed cluster
On 03/20/2012 03:04 PM, Christoph Bartoschek wrote: Am 20.03.2012 14:42, schrieb Andreas Kurz: Please share your drbd and cluster configuration ... two lines from log are not really enough to make suggestions based on facts. I am sure that the raid controller either was blocking or unavailable for some time: Mar 20 04:04:21 laplace kernel: [1786516.040017] aacraid: Host adapter abort request (0,0,1,0) Mar 20 04:04:21 laplace kernel: [1786516.047925] aacraid: Host adapter abort request (0,0,1,0) Mar 20 04:04:21 laplace kernel: [1786516.055909] aacraid: Host adapter abort request (0,0,1,0) Mar 20 04:04:21 laplace kernel: [1786516.063740] aacraid: Host adapter abort request (0,1,2,0) Mar 20 04:04:21 laplace kernel: [1786516.071576] aacraid: Host adapter reset request. SCSI hang ? Too bad, no I/O error so DRBD does a detach of the device ... Before this was recognized the a monitor event failed: Mar 20 04:04:05 laplace lrmd: [25177]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER Mar 20 04:04:10 laplace lrmd: [1941]: WARN: p_lvm_nfs:monitor process (PID 25087) timed out (try 1). Killing with signal SIGTERM (15). Mar 20 04:04:10 laplace lrmd: [1941]: WARN: Managed p_lvm_nfs:monitor process 25087 killed by signal 15 [SIGTERM - Termination (ANSI)]. Mar 20 04:04:10 laplace lrmd: [1941]: WARN: operation monitor[25] on ocf::LVM::p_lvm_nfs for client 1944, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1] volgrpname=[afs] CRM_meta_timeout=[2] CRM_meta_interval=[3] : pid [25087] timed out Then stopping the LVM resource failed and the cluster broke apart. Use stonith and the node would have been fenced. Best Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now The drbd.conf is: global { usage-count yes; } common { syncer { rate 125M; } } resource afs { protocol C; startup { wfc-timeout 0; degr-wfc-timeout 120; } disk { on-io-error detach; fencing resource-only; } handlers { fence-peer /usr/lib/drbd/crm-fence-peer.sh; after-resync-target /usr/lib/drbd/crm-unfence-peer.sh; } net { } on ries { device /dev/drbd1; disk /dev/sdb1; address10.1.0.2:7788; meta-disk internal; } on laplace { device /dev/drbd1; disk /dev/sdb1; address10.1.0.3:7788; meta-disk internal; } } The crm configuration is: node laplace \ attributes standby=on node ries \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=192.168.143.228 cidr_netmask=24 \ op monitor interval=30s primitive mail ocf:pacemaker:ClusterMon \ op monitor interval=180 timeout=20 \ params extra_options=--mail-to admin htmlfile=/tmp/crm_mon.html \ meta target-role=Started primitive p_drbd_nfs ocf:linbit:drbd \ params drbd_resource=afs \ op monitor interval=15 role=Master \ op monitor interval=30 role=Slave primitive p_exportfs_afs ocf:heartbeat:exportfs \ params fsid=1 directory=/srv/nfs/afs options=rw,no_root_squash clientspec=192.168.143.0/255.255.255.0 wait_for_leasetime_on_stop=false \ op monitor interval=30s primitive p_fs_afs ocf:heartbeat:Filesystem \ params device=/dev/afs/afs directory=/srv/nfs/afs fstype=ext4 \ op monitor interval=10s primitive p_lsb_nfsserver lsb:nfs-kernel-server \ op monitor interval=30s primitive p_lvm_nfs ocf:heartbeat:LVM \ params volgrpname=afs \ op monitor interval=30s group g_nfs p_lvm_nfs p_fs_afs p_exportfs_afs ClusterIP \ meta target-role=Started ms ms_drbd_nfs p_drbd_nfs \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started clone cl_lsb_nfsserver p_lsb_nfsserver clone cl_mail mail location drbd-fence-by-handler-ms_drbd_nfs ms_drbd_nfs \ rule $id=drbd-fence-by-handler-rule-ms_drbd_nfs $role=Master -inf: #uname ne ries colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start property $id=cib-bootstrap-options \ dc-version=1.0.9-unknown \ cluster-infrastructure=openais \ expected-quorum-votes=3 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1332235117 rsc_defaults $id=rsc-options \ resource-stickiness=200 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http
Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes
On 03/15/2012 11:50 PM, William Seligman wrote: On 3/15/12 6:07 PM, William Seligman wrote: On 3/15/12 6:05 PM, William Seligman wrote: On 3/15/12 4:57 PM, emmanuel segura wrote: we can try to understand what happen when clvm hang edit the /etc/lvm/lvm.conf and change level = 7 in the log session and uncomment this line file = /var/log/lvm2.log Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? Maybe it would help if I included the link to the pastebin where I put the output: http://pastebin.com/8pgW3Muw Could the problem be with lvm+drbd? In lvm2.conf, I see this sequence of lines pre-crash: device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes device/dev-io.c:588 Closed /dev/drbd0 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0, get some info, close. Post-crash, I see: evice/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:271 /dev/md0: size is 1027968 sectors device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes device/dev-io.c:588 Closed /dev/md0 filters/filter-composite.c:31 Using /dev/md0 device/dev-io.c:535 Opened /dev/md0 RO O_DIRECT device/dev-io.c:137 /dev/md0: block size is 1024 bytes label/label.c:186 /dev/md0: No label detected device/dev-io.c:588 Closed /dev/md0 device/dev-io.c:535 Opened /dev/drbd0 RO O_DIRECT device/dev-io.c:271 /dev/drbd0: size is 5611549368 sectors device/dev-io.c:137 /dev/drbd0: block size is 4096 bytes ... and then it hangs. Comparing the two, it looks like it can't close /dev/drbd0. If I look at /proc/drbd when I crash one node, I see this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s- ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 s- ... DRBD suspended io, most likely because of it's fencing-policy. For valid dual-primary setups you have to use resource-and-stonith policy and a working fence-peer handler. In this mode I/O is suspended until fencing of peer was succesful. Question is, why the peer does _not_ also suspend its I/O because obviously fencing was not successful . So with a correct DRBD configuration one of your nodes should already have been fenced because of connection loss between nodes (on drbd replication link). You can use e.g. that nice fencing script: http://goo.gl/O4N8f Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now If I look at /proc/drbd if I bring down one node gracefully (crm node standby), I get this: # cat /proc/drbd version: 8.3.12 (api:88/proto:86-96) GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r- ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0 Could it be that drbd can't respond to certain requests from lvm if the state of the peer is DUnknown instead of Outdated? Il giorno 15 marzo 2012 20:50, William Seligman selig...@nevis.columbia.edu ha scritto: On 3/15/12 12:55 PM, emmanuel segura wrote: I don't see any error and the answer for your question it's yes can you show me your /etc/cluster/cluster.conf and your crm configure show
Re: [Linux-HA] stonith/fence using external/libvirt on KVM
On 02/23/2012 02:59 PM, Tom Hanstra wrote: Hmmm, this is something which I did not understand when starting to look into this. If this is the case, it would be nice if the web pages were updated accordingly. You mean linux-ha.com? ... yeah that might be true. But looking at clusterlabs.org makes it quite clear, that corosync is the way to go for new setups ... there are also some nice faqs: http://clusterlabs.org/wiki/FAQ Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com Tom On 02/22/2012 05:48 PM, Andreas Kurz wrote: Since heartbeat is not actively developed any more, corosync is the way to go for a future proof setup. signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith/fence using external/libvirt on KVM
Hello Tom, On 02/22/2012 08:01 PM, Tom Hanstra wrote: I'm new to the Linux-HA clustering, though I've had experience with RedHat's Cluster packages for several years. I'm trying to see how the open source software compares. So, I set up two KVM Virtual servers running RHEL6 and compiled and installed the Cluster Glue, Heartbeat, and Pacemaker software. I was able to get two nodes running, though there are some errors which I will need to track down. Oh ... why did you build the complete stack manually? Pacemaker is technology preview in RHEL6 and it ships latest version in combination with corosync instead of Heartbeat this works really fine. From my other cluster experience, I know that getting fencing/stonith set up properly is something necessary and I want to work on that even before I try to track down other problems further. Without the ability to kill off a node, odd things can happen. So, my focus right now is on finding a working stonith device for this setup. I got all of the pieces I think I need for the external/libvirt device, have fence_virtd running on the host box and I do get output on both host and clients from the fence_xvm command: 1023$ fence_xvm -o list RH5_LIS0 25132742-8e3a-a1f2-a862-de3705ea8d8f on RH5_LIS1 2b4d4813-0107-6aec-a66f-2159ec95da4c on RH5_LIS2 fa6e2603-f7d6-34fa-dd03-4886cdf6e44b on RH5_LIS3 aafc6639-2d29-8bbe-4d62-38498f390563 on RH6_WITS7 51c15635-889f-7213-b0e2-e213f771e52a on RH6_WITS8 2ffdfebe-d49b-698b-b76c-a4abd8cbf42a on Where I am running into problems right now is translating the information I have from this command into the proper setup and syntax to set this as a stonith device and actually test killing off a node. The information given by this command gives the names of the virtual machines. But in my cluster setup, I have given these node names: lv7-eli = RH6_WITS7 lv8-eli = RH6_WITS8 Try something like that for a single host setup: primitive stonith_lv7-eli stonith:fence_virt \ params pcmk_host_check=static-list \ pcmk_host_list=lv7-eli \ port=RH6_WITS7 \ op monitor interval=600s ... and the same for the other node with adopted names. You should also take care to run the stonith resources not on that node that can be fenced by it ... like: location l_stonith_lv7-eli stonith_lv7-eli -inf: lv7-eli What is the proper stonith command that will actually kill off a node in such a KVM setup? And how does that translate into settings I would add to my ha.cf file? Even if you continue to use Heartbeat ccm instead of corosync, there is nothing to be added to ha.cf, all stonith resource configuration is done in the cib. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/services/remote Thanks, Tom Hanstra t...@nd.edu ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] stonith/fence using external/libvirt on KVM
Hello, On 02/22/2012 11:37 PM, Tom Hanstra wrote: See my further information with TH below... On 02/22/2012 05:09 PM, Andreas Kurz wrote: Hello Tom, On 02/22/2012 08:01 PM, Tom Hanstra wrote: I'm new to the Linux-HA clustering, though I've had experience with RedHat's Cluster packages for several years. I'm trying to see how the open source software compares. So, I set up two KVM Virtual servers running RHEL6 and compiled and installed the Cluster Glue, Heartbeat, and Pacemaker software. I was able to get two nodes running, though there are some errors which I will need to track down. Oh ... why did you build the complete stack manually? Pacemaker is technology preview in RHEL6 and it ships latest version in combination with corosync instead of Heartbeat this works really fine. TH Unfortunately, I'm limited to the educational version of RHEL6 which does not include any of the clustering software without additional charges. I just did a check on both corosync and pacemaker. For corosync, the packages show up but are inaccessible; for pacemaker, only pacemaker-cts is available. I'm not sure if this is sufficient but doubt it. I see ... well, Centos and Scientific Linux have all packages in their repos ... But is corosync better than heartbeat? Or am I getting into a religious war by asking that? Since heartbeat is not actively developed any more, corosync is the way to go for a future proof setup. From my other cluster experience, I know that getting fencing/stonith set up properly is something necessary and I want to work on that even before I try to track down other problems further. Without the ability to kill off a node, odd things can happen. So, my focus right now is on finding a working stonith device for this setup. I got all of the pieces I think I need for the external/libvirt device, have fence_virtd running on the host box and I do get output on both host and clients from the fence_xvm command: 1023$ fence_xvm -o list RH5_LIS0 25132742-8e3a-a1f2-a862-de3705ea8d8f on RH5_LIS1 2b4d4813-0107-6aec-a66f-2159ec95da4c on RH5_LIS2 fa6e2603-f7d6-34fa-dd03-4886cdf6e44b on RH5_LIS3 aafc6639-2d29-8bbe-4d62-38498f390563 on RH6_WITS7 51c15635-889f-7213-b0e2-e213f771e52a on RH6_WITS8 2ffdfebe-d49b-698b-b76c-a4abd8cbf42a on Where I am running into problems right now is translating the information I have from this command into the proper setup and syntax to set this as a stonith device and actually test killing off a node. The information given by this command gives the names of the virtual machines. But in my cluster setup, I have given these node names: lv7-eli = RH6_WITS7 lv8-eli = RH6_WITS8 Try something like that for a single host setup: primitive stonith_lv7-eli stonith:fence_virt \ params pcmk_host_check=static-list \ pcmk_host_list=lv7-eli \ port=RH6_WITS7 \ op monitor interval=600s TH Bear with me a bit. This is a crm configuration command, right. Can you help me understand where the information gets stored when I issue this command? I was thinking it would go to a file somewhere, but as you mention later, this information does not come from the ha.cf file. Where does it go? The cib.xml file is stored in /var/lib/heartbeat/crm directory and propagated to all nodes ... dont't manipulate it manually crm configure show gives you the crm syntax version which is much easier to read. ... and the same for the other node with adopted names. You should also take care to run the stonith resources not on that node that can be fenced by it ... like: location l_stonith_lv7-eli stonith_lv7-eli -inf: lv7-eli TH I'm not clear on what you mean here. This another configuration command, but I don't understand what it is doing. In my two node cluster, each node should be able to fence off the other. How does this command help to accomplish that? This is a location constraints that disallows the stonith resource capable of fencing lv7-eli to run on node lv7-eli. What is the proper stonith command that will actually kill off a node in such a KVM setup? And how does that translate into settings I would add to my ha.cf file? Even if you continue to use Heartbeat ccm instead of corosync, there is nothing to be added to ha.cf, all stonith resource configuration is done in the cib. Regards, Andreas Thanks for your help, You are welcome! Tom Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/services/custom-training signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Apache wont start on VIP
Hello, On 02/05/2012 01:57 AM, mike wrote: Hi all, I've got a very simple 2 node setup. It runs a few VIPs and uses ldirectord to load balance. That part works perfectly. On the same 2 node cluster I have apache running and it fails back and forth fine as long as ports.conf is set to listen on all ip's. I do have a VIP - 192.168.2.2 that I want Apache to start up on. I have tested apache manually on both nodes by editing ports.conf and setting it to 192.168.2.2 and starting apache from the command line - works fine. When Please share your config if you want authoritative answers ... without any further information it looks like a missing order constraint between apache and its IP. You should find logging output from apache RA in your syslogs Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now I try to get Apache to start with HA it fails. Here is what my current set up looks like: Last updated: Sat Feb 4 20:50:25 2012 Stack: Heartbeat Current DC: firethorn (3125c95a-33d1-4923-a5c0-38b228f90ecf) - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, unknown expected votes 3 Resources configured. Online: [ firethorn vanderbilt ] Resource Group: Web_Cluster_IP ApacheIP (ocf::heartbeat:IPaddr2): Started firethorn Resource Group: Web_Cluster WebSite(ocf::heartbeat:apache):Started firethorn ClusterIP (ocf::heartbeat:IPaddr2): Started firethorn ClusterIP2 (ocf::heartbeat:IPaddr2): Started firethorn Resource Group: LVS_Cluster LdirectorIP(ocf::heartbeat:IPaddr2): Started firethorn ldirectord (ocf::heartbeat:ldirectord):Started firethorn ApacheIP starts up first before apache does so I'm at a loss to understand why HA has issues starting apache. The logs are not revealing unfortunately. Initially I had the ApacheIP and apache in the same resource group so I thought maybe if I break the IP out into its own group then I'd know it comes up first and apache should follow. It doesn't sadly. So am I missing something obvious here? Why does apache start from the command line but not in HA unless ports.conf is set to listen on all interfaces? Thanks -mike ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Status about ocfs2.pcmk ?
Hello, On 02/03/2012 09:29 AM, alain.mou...@bull.net wrote: Hi Andreas , thanks for your response, but two questions : 1/ why going with GFS2 ? because you know that ocfs2+pacemaker still does not work fine on rhel ? or ... ? Because GFS2 is actively developed mostly by Redhat including the parts needed to glue it to Pacemaker and there have been some threads on the Pacemaker ML. I know it works with SLES11 SP1 with the packages shipped in the HA extension and also latest Debian/Ubuntu Packages should work. 2/ you're right GFS2 is working much better with pacemaker than OCFS2, but the problem is that GFS2 was about 10 times less efficient with regard to IO benchs than OCFS2 ! Never compared them by myself, I try to avoid using cluster file systems. What is your use case? Is this status has changed since 2010 ? I dont' think so when watching all the messages on Mailing List ... but I'm not sure .. I have the same impression, yes. Regards, Andreas Thanks Alain De :Andreas Kurz andr...@hastexo.com A : linux-ha@lists.linux-ha.org Date : 02/02/2012 15:47 Objet : Re: [Linux-HA] Status about ocfs2.pcmk ? Envoyé par :linux-ha-boun...@lists.linux-ha.org On 02/02/2012 02:54 PM, alain.mou...@bull.net wrote: Hi Just wonder if someone has succeded to configured a working HA configuration with Pacemaker/corosync and OCFS2 file systems, meaning using ocfs2.pcmk , on RHEL6 mainly (and eventually SLES11) ? For RHEL6 I'd recommend you go with GFS2 and follow the Cluster from Scratch documentation ... I know OCFS2 on SLES11 SP1 is working fine in a Pacemaker cluster. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Status about ocfs2.pcmk ?
On 02/02/2012 02:54 PM, alain.mou...@bull.net wrote: Hi Just wonder if someone has succeded to configured a working HA configuration with Pacemaker/corosync and OCFS2 file systems, meaning using ocfs2.pcmk , on RHEL6 mainly (and eventually SLES11) ? For RHEL6 I'd recommend you go with GFS2 and follow the Cluster from Scratch documentation ... I know OCFS2 on SLES11 SP1 is working fine in a Pacemaker cluster. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now (I tried at the end of 2010 but gave up after a few weeks because it was not working at all) Thanks if someone can give a status? Regards Alain Moullé ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together
Hello, On 01/13/2012 08:39 AM, Niclas Müller wrote: Is it necessary to put services like drbd, apache or mysql into pacemaker as a resource ? It worked without that, but is it better to add this as a service ? If you want them to be highly available, means monitored and automatically started/stopped when needed then yes, add it as a service. Pacemaker in combination with STONITH can also help you in split-brain scenarios. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now On 01/13/2012 01:59 AM, Andreas Kurz wrote: Hello, On 01/12/2012 11:29 PM, Niclas Müller wrote: That the VirtualIP isn't shown by 'ifconfig -a' is realy nice, because i made my failed search on this because of this howto : http://www.howtoforge.com/high_availability_loadbalanced_apache_cluster You follow a howto from the year 2006? ... anyway, ifconfig would show the IP because you used IPaddr and not IPaddr2 RA Im going to understand now all. I've configurated a resource for failover-ip from your link, but get this error page. Pacemaker cannot start resource failover-ip of course you already have an interface up in this network? The resource agent only adds secondary addresses And you should find most logging information on your DC - node2 in /var/log/{syslog,daemon.log) Regards, Andreas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together
On 01/13/2012 10:34 AM, Niclas Müller wrote: I've got it that apache and failover-ip runns on the cluster. I'm haning on the problem that hearbeat starts apache on node1 and failover-ip on node2 if both of them are running. Only if one node is offline, both resources are running on the online node. How can i say pacemaker - run this two resources only on one node everytime ? you need constraints ... don't stop reading documentation ;-) Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now On 01/13/2012 08:39 AM, Niclas Müller wrote: Is it necessary to put services like drbd, apache or mysql into pacemaker as a resource ? It worked without that, but is it better to add this as a service ? On 01/13/2012 01:59 AM, Andreas Kurz wrote: Hello, On 01/12/2012 11:29 PM, Niclas Müller wrote: That the VirtualIP isn't shown by 'ifconfig -a' is realy nice, because i made my failed search on this because of this howto : http://www.howtoforge.com/high_availability_loadbalanced_apache_cluster You follow a howto from the year 2006? ... anyway, ifconfig would show the IP because you used IPaddr and not IPaddr2 RA Im going to understand now all. I've configurated a resource for failover-ip from your link, but get this error page. Pacemaker cannot start resource failover-ip of course you already have an interface up in this network? The resource agent only adds secondary addresses And you should find most logging information on your DC - node2 in /var/log/{syslog,daemon.log) Regards, Andreas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together
On 01/12/2012 10:22 PM, Niclas Müller wrote: I'm currently going to setup a Linux HA Cluster with apache and MySQL. I've created three VM with KVM Vitalization. One NetworkManager as DNS and DHCP Server, and two other as Cluster Nodes. All VMs are both Debian Squeeze minimal installations. On the Nodes i've installed the packages heartbeat and pacemaker. The configuration of heartbeat seems like correct because in the syslog there are no errors and i can read that the nodes have contact. My first impression of the software packages is, that heartbeat is only for checking the Nodes of ability. Pacemaker is for the services which are to manager. I cannot get it on that heartbeat / pacemaker (??) use the virtual ip. In both interfaces there is no eth0:0 with the configurated ip address. Is the virtual ip used by the primary node only when a service is confiugrated? Have anybody a good howto for set up a apache and mysql cluster with heartbeat and pacemaker ? yes, Heartbeat is the CCM (cluster consensus and membership) layer and Pacemaker relies on it to get valid information about node health and uses it to transfer messages/updates to the other nodes. use crm_mon -1fr to see the current state of your cluster resources you already?? configured in Pacemaker. if you use IPaddr2 RA you have to use ip addr show to see the virtual IPs. there are a lot of howtos available, maybe you want to start at: http://www.clusterlabs.org/wiki/Documentation#Examples Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thank's Niclas P.S. : A user have send me a mail that with the mail client i send the question before made a mistake with my FORM Field, hopefully now away! ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Single Point of Failure
On 01/13/2012 12:22 AM, Paul O'Rorke wrote: hmmm - it looks like I may have to re-evaluate this. Geographic redundency is the point of this exercise, our office is in a location that has is less than ideal history for power reliability. We are a small software company and rely on email for online sales and product delivery so our solution - what ever it be - must allow for one location to completely lose power and still deliver client emails. as Dimitri says ... you really want to have a look at Google apps for business ... Regards, Andreas Mail is a very complex subject and I must confess that the excellent suggestions made here may be a little more than I was prepared to dive into. Given that this is a HA-Linux list, and that if I understand this correctly it is not really designed for multi-site clusters, can anyone suggest a more suitable technology? (the server is running CentOS/Exim) Or perhaps I should be doing the grunt work and trying out some of the above suggestions... I do appreciate the excellent feedback to date! thanks On Thu, Jan 12, 2012 at 1:53 PM, Arnold Krille arn...@arnoldarts.de wrote: On Thursday 12 January 2012 22:14:41 Jakob Curdes wrote: Miles Fidelman wrote: - you can set up a 2ndry server (give it an MX record with lower priority than the primary server) - it will receive mail when the primary goes down; and you can set up the mail config to forward stuff automatically to the primary server when it comes back up -- people won't be able to get to their mail until the primary comes back up, but mail will get accepted and will eventually get delivered Just one additional note: in such a setup, you should not assume that the secondary server only receives mail when the first one is down from your side of view. A client somewhere might have a different connectivity view and might deliver mail to your secondary MX at any time. It is well-known that spammer systems even try to deliver to the secondary in the hope that protection there is lower. So, if you have a secondary, you must arrange for mail delivered to that server to be passed on to the primary or a separate backend server. And you need to protect it exactly as good as your primary against virus, spam, and DOS attacks. So: If you go through the hazzles to set up a second receiving host with the same quality and administration requirements as the first one, you will also want to reflect that by giving it an equally high score in the mx field. That way both servers will be used equally and you get load-balancing where you originally meant to buy hot-standby:-) Another comment from here: Email is such an old protocol that the immunity to network errors was built in. If a sending host can't reach the receiver, it will try again after some time. And then again and again until a timeout is reached. And that timeout is not 2-4 seconds like with many tcp-based protocols but 4 days giving the admins the chance on monday to fix the mailserver that crashed on friday evening. Of course, if you rely on fast mail for your business, the price of redundant smtp and redundant pop3/imap servers might pay off. For redundant pop3/imap the cyrus project (and probably the other too) seem to have a special daemon to sync mails and mail-actions across servers. Add a redundant master-slave replicating mysql (or postgres) for the account database or even ldap and you should get something that even scales beyond 2 machine. Completely off-topic for this list as I haven't thrown in any heartbeat, pacemaker, corosync or drbd at this point. Have fun, Arnold ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together
Hello, On 01/12/2012 11:29 PM, Niclas Müller wrote: That the VirtualIP isn't shown by 'ifconfig -a' is realy nice, because i made my failed search on this because of this howto : http://www.howtoforge.com/high_availability_loadbalanced_apache_cluster You follow a howto from the year 2006? ... anyway, ifconfig would show the IP because you used IPaddr and not IPaddr2 RA Im going to understand now all. I've configurated a resource for failover-ip from your link, but get this error page. Pacemaker cannot start resource failover-ip of course you already have an interface up in this network? The resource agent only adds secondary addresses And you should find most logging information on your DC - node2 in /var/log/{syslog,daemon.log) Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now root@node1:/etc/ha.d# crm_mon -1fr Last updated: Thu Jan 12 17:25:43 2012 Stack: Heartbeat Current DC: node2 (40dbaed1-9618-41f0-acbe-c3f0f6334cce) - partition with quorum Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ node1 node2 ] Full list of resources: failover-ip(ocf::heartbeat:IPaddr):Stopped Migration summary: * Node node1: failover-ip: migration-threshold=100 fail-count=100 * Node node2: failover-ip: migration-threshold=100 fail-count=100 Failed actions: failover-ip_start_0 (node=node1, call=4, rc=1, status=complete): unknown error failover-ip_start_0 (node=node2, call=4, rc=1, status=complete): unknown error node $id=40dbaed1-9618-41f0-acbe-c3f0f6334cce node2 node $id=6c62b04f-0d3a-4bc5-a084-ffba618a8e87 node1 primitive failover-ip ocf:heartbeat:IPaddr \ params ip=192.168.0.100 \ op monitor interval=10s \ meta target-role=Started property $id=cib-bootstrap-options \ dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \ cluster-infrastructure=Heartbeat \ stonith-enabled=false Thank'a lot, the switch in my brain make PUK and now it seems all clear how this work together. Syslog doesn't show any errors about the failover-ip resource. Ideas ? On 01/12/2012 10:55 PM, Andreas Kurz wrote: On 01/12/2012 10:22 PM, Niclas Müller wrote: I'm currently going to setup a Linux HA Cluster with apache and MySQL. I've created three VM with KVM Vitalization. One NetworkManager as DNS and DHCP Server, and two other as Cluster Nodes. All VMs are both Debian Squeeze minimal installations. On the Nodes i've installed the packages heartbeat and pacemaker. The configuration of heartbeat seems like correct because in the syslog there are no errors and i can read that the nodes have contact. My first impression of the software packages is, that heartbeat is only for checking the Nodes of ability. Pacemaker is for the services which are to manager. I cannot get it on that heartbeat / pacemaker (??) use the virtual ip. In both interfaces there is no eth0:0 with the configurated ip address. Is the virtual ip used by the primary node only when a service is confiugrated? Have anybody a good howto for set up a apache and mysql cluster with heartbeat and pacemaker ? yes, Heartbeat is the CCM (cluster consensus and membership) layer and Pacemaker relies on it to get valid information about node health and uses it to transfer messages/updates to the other nodes. use crm_mon -1fr to see the current state of your cluster resources you already?? configured in Pacemaker. if you use IPaddr2 RA you have to use ip addr show to see the virtual IPs. there are a lot of howtos available, maybe you want to start at: http://www.clusterlabs.org/wiki/Documentation#Examples Regards, Andreas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] if promote runs into timeout
Hello, On 01/06/2012 06:14 PM, erkan yanar wrote: Moin, Im having the issue, that promoting a master can run into the promote timeout. After that, the resource is stopped and started as a slave. In my example it is a mysql resource, where promoting is going to wait for any replication lag to be applied. This could last a very long time. If the resource is not ready be promoted it should have no promotion score ... is this the mysql RA coming with the resource-agents package or a home-grown RA? There are some thoughts on that issue: 1. Dynamically increase the timeout with cibadmin. I havent tested that yet. Would this work? 2. op-fail=ignore With ignore, the resource is not restarted. But I don't like that approach. Is there an intelligent approach to dynamically change the timeout while promoting? Or is there a better approach anyway? Even if it would be ok to promote such an instance, why not increasing the promote timeout to a fixed safe value? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Regards Erkan signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] DRBD Split brain question
Hello, On 12/20/2011 02:47 PM, Ulrich Windl wrote: Hi! I have a dual-primary DRBD that is not working well: It was working, then I shut it down and restarted it. DRBD complained about split brain and fenced the other node. When coming up, the other node fenced this node. IMHO no node should have fenced each other. no config from drbd, no cluster config, partial/filtered logs ... fragments ... you have _all_ information and can't find the problem ... sorry, but I can't see how anyone can help you based on that information. I personally think it is part of the free community support deal to share as much information as possible if one wants help for free. Regards, Andreas -- Need help with Pacemaker or DRBD? http://www.hastexo.com/now Here are the logs from both nodes, restricted to DRBD: Dec 20 14:22:01 h06 kernel: [339936.743323] block drbd0: Starting worker thread (from cqueue [13353]) Dec 20 14:22:01 h06 kernel: [339936.743452] block drbd0: disk( Diskless - Attaching ) Dec 20 14:22:01 h06 kernel: [339936.767174] block drbd0: Found 4 transactions (6 active extents) in activity log. Dec 20 14:22:01 h06 kernel: [339936.767178] block drbd0: Method to ensure write ordering: barrier Dec 20 14:22:01 h06 kernel: [339936.767185] block drbd0: drbd_bm_resize called with capacity == 1048472 Dec 20 14:22:01 h06 kernel: [339936.767194] block drbd0: resync bitmap: bits=131059 words=2048 pages=4 Dec 20 14:22:01 h06 kernel: [339936.767197] block drbd0: size = 512 MB (524236 KB) Dec 20 14:22:01 h06 kernel: [339936.773015] block drbd0: bitmap READ of 4 pages took 2 jiffies Dec 20 14:22:01 h06 kernel: [339936.773032] block drbd0: recounting of set bits took additional 0 jiffies Dec 20 14:22:01 h06 kernel: [339936.773035] block drbd0: 0 KB (0 bits) marked out-of-sync by on disk bit-map. Dec 20 14:22:01 h06 kernel: [339936.773041] block drbd0: disk( Attaching - UpToDate ) Dec 20 14:22:01 h06 kernel: [339936.773045] block drbd0: attached to UUIDs 8344B9D0C389D2DC::902F198E803AB8E3:902E198E803AB8E3 Dec 20 14:22:01 h06 kernel: [339936.795343] block drbd0: conn( StandAlone - Unconnected ) Dec 20 14:22:01 h06 kernel: [339936.795395] block drbd0: Starting receiver thread (from drbd0_worker [10322]) Dec 20 14:22:01 h06 kernel: [339936.795452] block drbd0: receiver (re)started Dec 20 14:22:01 h06 kernel: [339936.795458] block drbd0: conn( Unconnected - WFConnection ) Dec 20 14:22:02 h06 kernel: [339937.490329] block drbd0: role( Secondary - Primary ) Dec 20 14:22:02 h06 kernel: [339937.490583] block drbd0: new current UUID B95131C56A7C2935:8344B9D0C389D2DC:902F198E803AB8E3:902E198E803AB8E3 Dec 20 14:22:02 h06 multipathd: drbd0: update path write_protect to '0' (uevent) Dec 20 14:22:02 h06 kernel: [339937.537270] block drbd0: Handshake successful: Agreed network protocol version 96 Dec 20 14:22:02 h06 kernel: [339937.537278] block drbd0: conn( WFConnection - WFReportParams ) Dec 20 14:22:02 h06 kernel: [339937.537335] block drbd0: Starting asender thread (from drbd0_receiver [10344]) Dec 20 14:22:02 h06 kernel: [339937.537725] block drbd0: data-integrity-alg: not-used Dec 20 14:22:02 h06 kernel: [339937.543391] block drbd0: drbd_sync_handshake: Dec 20 14:22:02 h06 kernel: [339937.543394] block drbd0: self B95131C56A7C2935:8344B9D0C389D2DC:902F198E803AB8E3:902E198E803AB8E3 bits:0 flags: 0 Dec 20 14:22:02 h06 kernel: [339937.543397] block drbd0: peer 3778E40F06BD4779:8344B9D0C389D2DC:902F198E803AB8E2:902E198E803AB8E3 bits:0 flags: 0 Dec 20 14:22:02 h06 kernel: [339937.543399] block drbd0: uuid_compare()=100 by rule 90 Dec 20 14:22:02 h06 kernel: [339937.543403] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 Dec 20 14:22:02 h06 kernel: [339937.546011] block drbd0: helper command: /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0) Dec 20 14:22:02 h06 kernel: [339937.546015] block drbd0: Split-Brain detected but unresolved, dropping connection! Dec 20 14:22:02 h06 kernel: [339937.546018] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 Dec 20 14:22:02 h06 kernel: [339937.551050] block drbd0: meta connection shut down by peer. Dec 20 14:22:02 h06 kernel: [339937.551056] block drbd0: conn( WFReportParams - NetworkFailure ) Dec 20 14:22:02 h06 kernel: [339937.551065] block drbd0: asender terminated Dec 20 14:22:02 h06 kernel: [339937.551067] block drbd0: Terminating asender thread Dec 20 14:22:02 h06 kernel: [339937.586136] block drbd0: helper command: /sbin/drbdadm split-brain minor-0 exit code 0 (0x0) Dec 20 14:22:02 h06 kernel: [339937.586146] block drbd0: conn( NetworkFailure - Disconnecting ) Dec 20 14:22:02 h06 kernel: [339937.586152] block drbd0: error receiving ReportState, l: 4! Dec 20 14:22:02 h06 kernel: [339937.586211] block drbd0: Connection closed Dec 20 14:22:02 h06 kernel: [339937.586217] block drbd0: conn( Disconnecting -
Re: [Linux-HA] Antw: Re: OCFS on top of dual-primary DRBD in SLES11 SP1
On 12/19/2011 09:15 AM, Ulrich Windl wrote: Andreas Kurz andr...@hastexo.com schrieb am 16.12.2011 um 14:01 in Nachricht 4eeb412e.9010...@hastexo.com: Hello Ulrich, On 12/16/2011 01:31 PM, Ulrich Windl wrote: Hi! I have some troubel with OCFS on top of DRBD that seems to be timing-related: OCFS is working on the DRBD when DRBD itself wants to vhange something it seems: can we see your cib and your full drbd cofniguration please ... It's somewhat complex, and I may not show you everything, sorry for that. no problem ... you asked for help on a public mailing-list ... ... Dec 16 11:39:58 h06 kernel: [ 122.426174] block drbd0: role( Secondary - Primary ) Dec 16 11:39:58 h06 multipathd: drbd0: update path write_protect to '0' (uevent) Dec 16 11:40:29 h06 ocfs2_controld: start_mount: uuid FD32E504527742CEA7DA6DB272D5D7B2, device /dev/drbd_r0, service ocfs2 ... Dec 16 11:40:29 h06 kernel: [ 152.837615] block drbd0: peer( Secondary - Primary ) Dec 16 11:40:29 h06 ocfs2_hb_ctl[19177]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -P -d /dev/drbd_r0 Dec 16 11:43:50 h06 kernel: [ 354.559240] block drbd0: State change failed: Device is held open by someone Dec 16 11:43:50 h06 kernel: [ 354.559244] block drbd0: state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r- } Dec 16 11:43:50 h06 kernel: [ 354.559246] block drbd0: wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r- } Dec 16 11:43:50 h06 drbd[28754]: [28786]: ERROR: r0: Called drbdadm -c /etc/drbd.conf secondary r0 Dec 16 11:43:50 h06 drbd[28754]: [28789]: ERROR: r0: Exit code 11 A little bit later DRBD did it's own fencing (the machine rebooted) do you have logs to confirm this? Naturally no, as the commands echo b /proc/sysrq-trigger ; reboot -f don't actually write nice log messages. All those nice drbd notify scripts do send mails, at least to local root account. Additionally they try to log via syslog as well as DRBD does on executing the handler ... so you have a good chance to get some information if DRBD triggers that reboot ... at least if you are doing remote syslogging. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OCFS on top of dual-primary DRBD in SLES11 SP1
Hello Ulrich, On 12/16/2011 01:31 PM, Ulrich Windl wrote: Hi! I have some troubel with OCFS on top of DRBD that seems to be timing-related: OCFS is working on the DRBD when DRBD itself wants to vhange something it seems: can we see your cib and your full drbd cofniguration please ... ... Dec 16 11:39:58 h06 kernel: [ 122.426174] block drbd0: role( Secondary - Primary ) Dec 16 11:39:58 h06 multipathd: drbd0: update path write_protect to '0' (uevent) Dec 16 11:40:29 h06 ocfs2_controld: start_mount: uuid FD32E504527742CEA7DA6DB272D5D7B2, device /dev/drbd_r0, service ocfs2 ... Dec 16 11:40:29 h06 kernel: [ 152.837615] block drbd0: peer( Secondary - Primary ) Dec 16 11:40:29 h06 ocfs2_hb_ctl[19177]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -P -d /dev/drbd_r0 Dec 16 11:43:50 h06 kernel: [ 354.559240] block drbd0: State change failed: Device is held open by someone Dec 16 11:43:50 h06 kernel: [ 354.559244] block drbd0: state = { cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r- } Dec 16 11:43:50 h06 kernel: [ 354.559246] block drbd0: wanted = { cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r- } Dec 16 11:43:50 h06 drbd[28754]: [28786]: ERROR: r0: Called drbdadm -c /etc/drbd.conf secondary r0 Dec 16 11:43:50 h06 drbd[28754]: [28789]: ERROR: r0: Exit code 11 A little bit later DRBD did it's own fencing (the machine rebooted) do you have logs to confirm this? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Is there a way to let the cluster do the fencing instead of writing to sysctl? Those handlers are used: handlers { pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b /proc/sysrq-trigger ; reboot -f; local-io-error /usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o /proc/sysrq-trigger ; halt -f; Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] disconnecting network of any node cause both nodes fenced
On 12/05/2011 12:21 PM, Muhammad Sharfuddin wrote: On Sun, 2011-12-04 at 23:47 +0100, Andreas Kurz wrote: Hello, On 12/04/2011 09:29 PM, Muhammad Sharfuddin wrote: This cluster reboots(fenced) both nodes, if I disconnects network of any nodes(simulating network failure). Completely loss of network is indistinguishable for a cluster node to a dead peer. I want that if any node disconnects from network, resources running on that node should be moved/migrate to the other node(network connected node) Use ping RA for connectivity checks and use location constraints to move resources according to network connectivity (to external ping targets) so in case of having a ping RA with appropriate location rule, does at least make sure that if any one node lose the network connectivity(i.e both nodes cant see each other, while only one node is disconnected from network), the other healthy node(network connected node) wont reboot ... is it what you said ? No ... in case of service network loss of one node, resources can move to the other node if it has a better connectivity. For this to work, the nodes still need an extra communication path. How can I prevent this cluster to reboot(fence) the healthy node(i.e the node whose network is up/available/connected). Multiple-failure scenarios are challenging and possible solutions for a cluster are limited. With enough effort by an administrator every cluster can be tested to death. You can only minimize the possibility of a split-brain: * use redundant cluster communication paths (limited to two with corosync) in my test I disconnected every communication path of one node(and both rebooted) Did you clone the sbd resource? If yes, don't do that. Start it as a primitive, so in case of a split brain at least one node needs to start the stonith resource which should give the other node an advantage ... adding a start-delay should further increase that advantage. * at least one communication path is direct connected directly connected communication path and ping RA with location rule.. will prevent the reboot of healthy node(network connected node) No, don't use the other node as ping target ... that's ccm business ... directly connected networks are simply less error-prone than switched networks ... except for administrative interventions ;-) * use a quorum node i.e I should add another node(quorum node) in this two node cluster. Yes ... you can add a node in permanent standby mode or starting corosync without pacemaker should also work fine. If you are using a network connected fencing device use this network also for cluster communication. To prevent stonith death matches use power-off as stonith action or/and don't start cluster services on system startup. cluster does not start at system startup fine Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Regards, Andreas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Regards, Muhammad Sharfuddin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] disconnecting network of any node cause both nodes fenced
Hello, On 12/04/2011 09:29 PM, Muhammad Sharfuddin wrote: This cluster reboots(fenced) both nodes, if I disconnects network of any nodes(simulating network failure). Completely loss of network is indistinguishable for a cluster node to a dead peer. I want that if any node disconnects from network, resources running on that node should be moved/migrate to the other node(network connected node) Use ping RA for connectivity checks and use location constraints to move resources according to network connectivity (to external ping targets) How can I prevent this cluster to reboot(fence) the healthy node(i.e the node whose network is up/available/connected). Multiple-failure scenarios are challenging and possible solutions for a cluster are limited. With enough effort by an administrator every cluster can be tested to death. You can only minimize the possibility of a split-brain: * use redundant cluster communication paths (limited to two with corosync) * at least one communication path is direct connected * use a quorum node If you are using a network connected fencing device use this network also for cluster communication. To prevent stonith death matches use power-off as stonith action or/and don't start cluster services on system startup. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now I am using following STONITH resource primitive sbd_stonith stonith:external/sbd \ meta target-role=Started \ op monitor interval=3000 timeout=120 \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ params sbd_device=/dev/disk/by-id/scsi-360080e50002377b802ff4e4bc873 -- Regards, Muhammad Sharfuddin ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Light Weight Quorum Arbitration
Hello Eric, On 12/03/2011 02:36 PM, Robinson, Eric wrote: I have a geographically dispersed (stretch) cluster, where one node is in data center A and the other node is in data center B. I have done everything possible to ensure link redundancy between the cluster nodes. Each node has 4 x gigabit links connected to 4 different sets of switches and routers that connect the two data centers. The data centers are connected over two dual counter-rotating SONET rings. That said, the possibility remains that the links between the two data centers could be severed, leading to cluster partition. Is there a way to provide another quorum vote or something equivalent from a third location out on the Internet without having a full cluster node out there? Florian mentioned earlier that a full cluster node would probably not work well because of the bandwidth and latencies involved. What I really want is some kind of lightweight arbiter or quorum daemon at the third location. I've looked around but have not seen anything like that. Does anyone have any ideas? I've thought of trying to roll my own using ssh and shell scripts. the concept of an arbitrator for split-site cluster is already implemented and should be available with Pacemaker 1.1.6 though it seem to be not directly documented ... beside source code and this draft document: http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.geo.html I think this is exactly what you are looking for. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now -- Eric Robinson Disclaimer - December 3, 2011 This email and any files transmitted with it are confidential and intended solely for linux-ha@lists.linux-ha.org. If you are not the named addressee you should not disseminate, distribute, copy or alter this email. Any views or opinions presented in this email are solely those of the author and might not represent those of Physicians' Managed Care or Physician Select Management. Warning: Although Physicians' Managed Care or Physician Select Management has taken reasonable precautions to ensure no viruses are present in this email, the company cannot accept responsibility for any loss or damage arising from the use of this email or attachments. This disclaimer was added by Policy Patrol: http://www.policypatrol.com/ ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Custom resource agent script assistance
Hello Chris, On 12/01/2011 06:25 PM, Chris Bowlby wrote: Hi Everyone, I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster using the following packages: SLES 11 SP1, with Pacemaker 1.1.6, corosync 1.4.2, and drbd 8.3.12. I know about DHCP's internal fail-over abilities, but after testing, it simply failed to remain viable as a more robust HA type cluster. As such I began working on this solution. For reference my current configuration looks like this: node dhcp-vm01 \ attributes standby=off node dhcp-vm02 \ attributes standby=on primitive DHCPFS ocf:heartbeat:Filesystem \ params device=/dev/drbd1 directory=/var/lib/dhcp fstype=ext4 \ meta target-role=Started primitive dhcp-cluster ocf:heartbeat:IPaddr2 \ params ip=xxx.xxx.xxx.xxx cidr_netmask=32 \ op monitor interval=10s primitive dhcpd_service ocf:heartbeat:dhcpd \ params dhcpd_config=/etc/dhcpd.conf \ dhcpd_interface=eth0 \ op monitor interval=1min \ meta target-role=Started primitive dhcpdrbd ocf:linbit:drbd \ params drbd_resource=dhcpdata \ op monitor interval=60s ms DHCPData dhcpdrbd \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation dhcpd_service-with_cluster_ip inf: dhcpd_service dhcp-cluster colocation fs_on_drbd inf: DHCPFS DHCPData:Master order DHCP-after-dhcpfs inf: DHCPFS:promote dhcpd_service:start order dhcpfs_after_dhcpdata inf: DHCPData:promote DHCPFS:start DHCPFS:promote ?? .. that action will never occour, so dhcpd_service will start whenever it likes ... typically not when it should ;-) ... remove that :promote ... and you miss a colocation between dhcpd_service and it's file system. I'd suggest using a group and colocate/order that with DRBD: group g_dhcp DHCPFS dhcpd_service dhcp-cluster .. or IP before dhcp if it needs to bind to it Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now property $id=cib-bootstrap-options \ dc-version=1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore rsc_defaults $id=rsc-options \ resource-stickiness=100 The floating IP works without issue, as does the DRBD integration such that if I put a node into standby, the IP, DRBD master/slave and FS mounts all transfer correctly. Only the DHCP component itself is failing, in that it wont start properly from within pacemaker. I suspect it is due to having to write a new script as I could not find an existing DHCPD RA agent anywhere. I built my own based off the development guide for resource agents on the wiki. I've managed to get it to complete all the tests I need it to pass in the ocf-tester script: ocf-tester -n dhcpd -o monitor_client_interface=eth0 /usr/lib/ocf/resource.d/heartbeat/dhcpd Beginning tests for /usr/lib/ocf/resource.d/heartbeat/dhcpd... * Your agent does not support the notify action (optional) * Your agent does not support the demote action (optional) * Your agent does not support the promote action (optional) * Your agent does not support master/slave (optional) /usr/lib/ocf/resource.d/heartbeat/dhcpd passed all tests Additionally if I run each of the various options (start/stop/monitor/validate-all/status/meta-data) at the command line, they all work with out issue, and stop/start the DHCPD process as expected. dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp root 12516 0.0 0.1 4344 756 pts/3S+ 17:16 0:00 grep dhcp dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # /usr/lib/ocf/resource.d/heartbeat/dhcpd start DEBUG: Validating the dhcpd binary exists. DEBUG: Validating that we are running in chrooted mode DEBUG: Chrooted mode is active, testing the chrooted path exists DEBUG: Checking to see if the /var/lib/dhcp//etc/dhcpd.conf exists and is readable DEBUG: Validating the dhcpd user exists DEBUG: Validation complete, everything looks good. DEBUG: Testing the state of the daemon itself DEBUG: OCF_NOT_RUNNING: 7 INFO: The dhcpd process is not running Internet Systems Consortium DHCP Server V3.1-ESV Copyright 2004-2010 Internet Systems Consortium. All rights reserved. For info, please visit https://www.isc.org/software/dhcp/ WARNING: Host declarations are global. They are not limited to the scope you declared them in. Not searching LDAP since ldap-server, ldap-port and ldap-base-dn were not specified in the config file Wrote 0 deleted host decls to leases file. Wrote 0 new dynamic host decls to leases file. Wrote 0 leases to leases file. Listening on LPF/eth0/00:0c:29:d7:64:99/SERVERS Sending on LPF/eth0/00:0c:29:d7:64:99/SERVERS Sending on Socket/fallback/fallback-net 0 INFO: dhcpd [chrooted] has started. DEBUG: Resource Agent Exit Status 0 DEBUG: default
Re: [Linux-HA] Q: RA reload
On 11/30/2011 12:58 PM, Ulrich Windl wrote: Hi, when changing the performce-related-only mount option for a filesystem I noticed that the LRM decided to restart the resource and all the depending resources. As I know that Linux supports -o remount, such a restart would not be necessary. So I wonder: When ever will the LRM decide to try a reload method (assuming the RA has one)? A pointer to the documentation would be OK. http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#s-reload Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Pacemaker : how to modify configuration ?
On 11/29/2011 10:01 AM, alain.mou...@bull.net wrote: Too long to explain : but in short it is for maintenance team of a cluster to be able to temporarily avoid fencing due to the Flush delayed problem in case of resources relocate, which randomly leads to monitoring failed and therefore fence ... whereas it is not a valid error, it is a bug (I've opened another thread on this ML about this Flush delayed problem) sounds like you also want to implement ACLs ... to limit maintenance team members to only modify special parts of your config. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Alain De :RaSca ra...@miamammausalinux.org A : General Linux-HA mailing list linux-ha@lists.linux-ha.org Cc :alain.mou...@bull.net Date : 29/11/2011 09:35 Objet : Re: [Linux-HA] Pacemaker : how to modify configuration ? Il giorno Mar 29 Nov 2011 09:30:41 CET, alain.mou...@bull.net ha scritto: Hi Yes I know it is possible this way, but I don't like to tell anybody to use crm configure edit because it is a command a little bit risky, risk of corruption of the file ... when I'm the person who operates, I often use crm configure edit, but I'm a little reluctant to tell somebody else not really a pacemaker specialist to use this command. So I'd prefer a command with cibadmin/grep/sed as Andrew suggest it. Thanks Alain Consider that a bad configuration is not being processed by the crm editor. In addition it is possible to do a dump of the actual configuration before doing any modifications. That said... If you're reluctant to make a non specialist users modify the configuration, then why let them modify delicate parameters such as on-fail? signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] is it good to create order constraint for sbd resource
On 11/29/2011 09:16 PM, Tim Serong wrote: On 11/29/2011 04:28 PM, Muhammad Sharfuddin wrote: On Mon, 2011-11-28 at 21:47 +0100, Tim Serong wrote: On 11/28/2011 06:54 PM, Muhammad Sharfuddin wrote: is it good/required to create order constraint for sbd resource I am using following fencing resource: primitive sbd_stonith stonith:external/sbd \ meta target-role=Started \ op monitor interval=3000 timeout=120 \ op start interval=0 timeout=120 \ op stop interval=0 timeout=120 \ params sbd_device=/dev/disk/by-id/scsi-360080e50002377b802ff4e4bc873 I have following order constraints: order resA-before-resB inf: resA resB symmetrical=true order resB-before-resC inf: resB resC symmetrical=true should I also create another constraint for sbd like: order sbd_stonith-before-resA inf: sbd_stonith resA symmetrical=true please help/suggest. No. The STONITH resource doesn't need to be running in order for your other resources to be operable (hence no need for an order constraint). Regards, Tim true, but if there is an order constraint for the STONITH resource, then it will at least make it sure that no other resource will be start before the STONITH resource. e.g: order sbd_stonith-before-resA inf: sbd_stonith resA symmetrical=true order resA-before-resB inf: resA resB symmetrical=true order resB-before-resC inf: resB resC symmetrical=true Because I stopped all the resources including STONITH resource(and stopping any resource sets the 'target-role=Stopped'), then started all other resources else/except the STONITH resource, so at that time my cluster has no fencing resource available. So, don't stop the STONITH resource :) And if you do it you are on your own ... you want it? you get it! :-) There are a lot of ways for an administrator to lower service downtime ... Side point: if you use crm configure property stop-all-resources=true, this will stop all resources *except* for any STONITH resources. The point being, you do always want them running... So in order to protect the cluster I thought that there should(must) be an order constraint that specifies that no other resource(s) will be start if STONITH resource is stopped/unavailable. Please suggest/recommend You should generally be OK without order constraints on STONITH resources. I don't recall seeing any other systems where people had created these constraints. I should also note that if, say, your STONITH resource is running on node-0 and that node dies, the cluster will start the STONITH resource on node-1, to kill node-0. It's smart enough. Worst case, if your STONITH resource is completely broken, and a node fails and thus can't be killed, the cluster will sit there and log errors to syslog about its inability to kill the misbehaving node. (Question for everyone else: did I miss anything?) IIRC stonith resources are always started first and stopped last anyways ... without extra constraints ... implicitly. Please someone correct me if I'm wrong. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Regards, Tim signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Stonith SBD not fencing nodes
On 11/28/2011 08:07 PM, Hal Martin wrote: Thank you for the updated link. I have recompiled pacemaker from checkout b9889764 and stonith still fails to shoot nodes. Maybe posting also the logs from sdgxen-3 can help. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now sdgxen-2:/ # crm node fence sdgxen-3 Do you really want to shoot sdgxen-3? y Syslog from sdgxen-2: Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node sdgxen-3 will be fenced because termination was requested Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: determine_online_status: Node sdgxen-3 is unclean Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: stage6: Scheduling Node sdgxen-3 for STONITH Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: LogActions: Leave stonith-sbd(Started sdgxen-2) Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: unpack_graph: Unpacked transition 4: 4 actions in 4 synapses Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1322492480-29) derived from /var/lib/pengine/pe-warn-1278.bz2 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_pseudo_action: Pseudo action 5 fired and confirmed Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_fence_node: Executing reboot fencing operation (8) on sdgxen-3 (timeout=6) Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: initiate_remote_stonith_op: Initiating remote operation reboot for sdgxen-3: 76727be7-eecb-4778-857c-1a9288c63ee6 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: can_fence_host_with_device: stonith-sbd can not fence sdgxen-3: dynamic-list Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: process_pe_message: Transition 4: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-1278.bz2 Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: process_pe_message: Configuration WARNINGs found during PE processing. Please run crm_verify -L to identify issues. Nov 28 15:01:26 sdgxen-2 stonith-ng: [452]: ERROR: remote_op_query_timeout: Query 76727be7-eecb-4778-857c-1a9288c63ee6 for sdgxen-3 timed outNov 28 15:01:26 sdgxen-2 stonith-ng: [452]: ERROR: remote_op_timeout: Action reboot (76727be7-eecb-4778-857c-1a9288c63ee6) for sdgxen-3 timed outNov 28 15:01:26 sdgxen-2 stonith-ng: [452]: info: remote_op_done: Notifing clients of 76727be7-eecb-4778-857c-1a9288c63ee6 (reboot of sdgxen-3 from ee8c34db-0e5d-4227-aa46-0ad8b3f306d1 by (null)): 0, rc=-8Nov 28 15:01:26 sdgxen-2 stonith-ng: [452]: info: stonith_notify_client: Sending st_fence-notification to client 457/67849bf4-1881-48b9-a5e8-ab1f72116a81Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: tengine_stonith_callback: StonithOp remote-op state=0 st_target=sdgxen-3 st_op=reboot /Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: tengine_stonith_callback: Stonith operation 2/8:4:0:bd203590-3295-4f31-a720-01760a5394e8: Operation timed out (-8)Nov 28 15:01:26 sdgxen-2 crmd: [457]: ERROR: tengine_stonith_callback: Stonith of sdgxen-3 failed (-8)... aborting transition.Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: abort_transition_graph: tengine_stonith_callback:454 - Triggered transition abort (complete=0) : Stonith failedNov 28 15:01:26 sdgxen-2 crmd: [457]: info: update_abort_priority: Abort priority upgraded from 0 to 100Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: update_abort_priority: Abort action done superceeded by restartNov 28 15:01:26 sdgxen-2 crmd: [457]: ERROR: tengine_stonith_notify: Peer sdgxen-3 could not be terminated (reboot) by anyone for sdgxen-2 (ref=76727be7-eecb-4778-857c-1a9288c63ee6): Operation timed outNov 28 15:01:26 sdgxen-2 crmd: [457]: info: run_graph: Nov 28 15:01:26 sdgxen-2 crmd: [457]: notice: run_graph: Transition 4 (Complete=2, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pengine/pe-warn-1278.bz2): StoppedNov 28 15:01:26 sdgxen-2 crmd: [457]: info: te_graph_trigger: Transition 4 is now completeNov 28 15:01:26 sdgxen-2 crmd: [457]: info: do_state_transition: State transition S_TRANSITION_ENGINE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: do_pe_invoke: Query 81: Requesting the current CIB: S_POLICY_ENGINENov 28 15:01:26 sdgxen-2 crmd: [457]: info: do_pe_invoke_callback: Invoking the PE: query=81, ref=pe_calc-dc-1322492486-30, seq=240, quorate=1Nov 28 15:01:26 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node sdgxen-3 will be fenced because termination was requestedNov 28 15:01:26
Re: [Linux-HA] Antw: Re: Stonith SBD not fencing nodes
On 11/29/2011 12:14 AM, Hal Martin wrote: Sorry; they were included in the previous email but it appears it was not properly spaced to be noticeable in the wall of text. Indeed ... already there, sorry for the noise. strange ... where does this timeout come from? I don't see an evidence this fencing request ran for 60sec ... Did you try to provoke a fencing action without using crm shell? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Syslog from sdgxen-3: Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_ais_dispatch: Update relayed from sdgxen-2 Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_trigger_update: Sending flush op to all hosts for: terminate (true) Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_perform_update: Sent update 7: terminate=true Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node sdgxen-2 now has id: 2065306796 Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node 2065306796 is now known as sdgxen-2 Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 Nov 28 15:01:21 sdgxen-3 sbd: [442]: info: Latency: 1 Nov 28 15:01:22 sdgxen-3 sbd: [442]: info: Latency: 1 Nov 28 15:01:23 sdgxen-3 sbd: [442]: info: Latency: 1 Nov 28 15:01:24 sdgxen-3 sbd: [442]: info: Latency: 1 Nov 28 15:01:25 sdgxen-3 sbd: [442]: info: Latency: 1 Nov 28 15:01:26 sdgxen-3 sbd: [442]: info: Latency: 1 Nov 28 15:01:14 sdgxen-3 stonith-ng: [452]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 Thanks, -Hal On Mon, Nov 28, 2011 at 6:10 PM, Andreas Kurz andr...@hastexo.com wrote: On 11/28/2011 08:07 PM, Hal Martin wrote: Thank you for the updated link. I have recompiled pacemaker from checkout b9889764 and stonith still fails to shoot nodes. Maybe posting also the logs from sdgxen-3 can help. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now sdgxen-2:/ # crm node fence sdgxen-3 Do you really want to shoot sdgxen-3? y Syslog from sdgxen-2: Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node sdgxen-3 will be fenced because termination was requested Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: determine_online_status: Node sdgxen-3 is unclean Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: stage6: Scheduling Node sdgxen-3 for STONITH Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: LogActions: Leave stonith-sbd(Started sdgxen-2) Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_state_transition: State transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: unpack_graph: Unpacked transition 4: 4 actions in 4 synapses Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1322492480-29) derived from /var/lib/pengine/pe-warn-1278.bz2 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_pseudo_action: Pseudo action 5 fired and confirmed Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_fence_node: Executing reboot fencing operation (8) on sdgxen-3 (timeout=6) Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: initiate_remote_stonith_op: Initiating remote operation reboot for sdgxen-3: 76727be7-eecb-4778-857c-1a9288c63ee6 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: can_fence_host_with_device: stonith-sbd can not fence sdgxen-3: dynamic-list Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: stonith_command: Processed st_query from sdgxen-2: rc=0 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: process_pe_message: Transition 4: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-1278.bz2 Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: process_pe_message: Configuration WARNINGs found during PE processing. Please run crm_verify -L to identify issues. Nov 28 15:01:26 sdgxen-2 stonith-ng: [452]: ERROR: remote_op_query_timeout: Query 76727be7-eecb-4778-857c-1a9288c63ee6 for sdgxen-3 timed outNov 28 15:01:26 sdgxen-2 stonith-ng: [452]: ERROR: remote_op_timeout: Action reboot (76727be7-eecb-4778-857c-1a9288c63ee6) for sdgxen-3 timed outNov 28 15:01:26 sdgxen-2 stonith-ng: [452]: info: remote_op_done: Notifing clients of 76727be7-eecb-4778-857c-1a9288c63ee6 (reboot of sdgxen-3 from ee8c34db-0e5d-4227-aa46-0ad8b3f306d1 by (null)): 0, rc=-8Nov 28 15:01:26 sdgxen-2 stonith-ng: [452]: info: stonith_notify_client: Sending st_fence-notification to client 457/67849bf4-1881-48b9-a5e8-ab1f72116a81Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: tengine_stonith_callback: StonithOp remote-op state=0 st_target=sdgxen-3 st_op=reboot /Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: tengine_stonith_callback: Stonith operation 2/8:4:0:bd203590-3295-4f31-a720-01760a5394e8: Operation timed out (-8)Nov 28 15:01:26 sdgxen-2 crmd: [457]: ERROR: tengine_stonith_callback: Stonith of sdgxen-3 failed (-8)... aborting
Re: [Linux-HA] Problem when installing Cluster Glue 1.0.8
On 11/26/2011 11:39 PM, Lazaro Rubén García Martinez wrote: Hello everyone in this list, I am a new member in the list. I write you because a need to install heartbeat, but I am new with this software. I try to install the Cluster Glue in a MACHINE WITH CentOS 6, but when I execute the make command this error is shows: is there a special reason for you to not use Pacemaker and rest of cluster stack that are already shipped with CentOS6? gmake[2]: Entering directory `/usr/local/src/heartbeat/Reusable-Cluster-Components-glue--7583026c6ace/doc' /usr/bin/xsltproc \ --xinclude \ http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl hb_report.xml error : Operation in progress warning: failed to load external entity http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl; cannot parse http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl gmake[2]: *** [hb_report.8] Error 4 gmake[2]: Leaving directory `/usr/local/src/heartbeat/Reusable-Cluster-Components-glue--7583026c6ace/doc' gmake[1]: *** [all-recursive] Error 1 gmake[1]: Leaving directory `/usr/local/src/heartbeat/Reusable-Cluster-Components-glue--7583026c6ace/doc' make: *** [all-recursive] Error 1 all the build requirements installed? ... libxslt docbook-dtds docbook-style-xsl help2man just to name some Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Can anybody tell me how to solve this problem? Thank you very much for your time Regards, and sorry for my bad English. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] The master server of HA system suddently roll over
On 11/22/2011 01:27 AM, tyo...@globalchoice.us wrote: Dear sirs: This is Yang. I set up 2 database server using heartbeat (heartbeat-2.1.3-3.el5.centos.rpm). They are running for over 40 days Please consider an update to Corosync Pacemaker ... very well. But suddently master Datebase server roll over to slave Database server. When I check the master server, everything is good. and I copy the ha-log as follows: heartbeat[16316]: 2011/11/21_10:46:40 info: killing /usr/local/bin/check_service process group 18673 with signal 15 heartbeat[16316]: 2011/11/21_10:46:40 info: killing /usr/lib/heartbeat/mgmtd -v process group 16383 with signal 15 mgmtd[16383]: 2011/11/21_10:46:40 info: mgmtd is shutting down heartbeat[16316]: 2011/11/21_10:46:40 info: killing /usr/lib/heartbeat/crmd process group 16382 with signal 15 crmd[16382]: 2011/11/21_10:46:40 info: crm_shutdown: Requesting shutdown crmd[16382]: 2011/11/21_10:46:40 info: do_state_transition: State transition S_IDLE - S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ] This looks like someone simply shutdown Heartbeat ... if this is the reason you should find clear indication in logs. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now crmd[16382]: 2011/11/21_10:46:40 info: do_state_transition: All 2 cluster nodes are eligible to run resources. crmd[16382]: 2011/11/21_10:46:40 info: do_shutdown_req: Sending shutdown request to DC: kanridb-master crmd[16382]: 2011/11/21_10:46:40 info: do_shutdown_req: Processing shutdown locally crmd[16382]: 2011/11/21_10:46:40 info: handle_shutdown_request: Creating shutdown request for kanridb-master tengine[16585]: 2011/11/21_10:46:40 info: extract_event: Aborting on shutdown attribute for 9134a5e8-6a99-4392-ae5b-06e6a05dd9c0 tengine[16585]: 2011/11/21_10:46:40 info: update_abort_priority: Abort priority upgraded to 100 pengine[16586]: 2011/11/21_10:46:40 info: determine_online_status: Node kanridb-master is shutting down pengine[16586]: 2011/11/21_10:46:40 info: determine_online_status: Node kanridb-slave is online pengine[16586]: 2011/11/21_10:46:40 notice: group_print: Resource Group: group_1 . Do you have someone can tell me why this master server have to roll over, what's wrong for this server at 2011/11/21_10:46:40. Thanks Yang ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [Heartbeat][Pacemaker] VIP doesn't swith to other server
Hello Mathieu, On 11/17/2011 07:22 PM, SEILLIER Mathieu wrote: Hi all, I have to use Heartbeat with Pacemaker for High Availability between 2 Tomcat 5.5 servers under Linux RedHat 5.4. The first server is active, the other one is passive. The master is called servappli01, with IP address 186.20.100.81, the slave is called servappli02, with IP address 186.20.100.82. I configured a virtual IP 186.20.100.83. Each Tomcat is not launched when server is started, this is Heartbeat which starts Tomcat when it's running. All seem to be OK, each server see the other as active, and the crm_mon command shows this below : Last updated: Thu Nov 17 19:03:34 2011 Stack: Heartbeat Current DC: servappli01 (bf8e9a46-8691-4838-82d9-942a13aeedca) - partition with quorum Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ servappli01 servappli02 ] Clone Set: ClusterIPClone (unique) ClusterIP:0(ocf::heartbeat:IPaddr2): Started servappli01 ClusterIP:1(ocf::heartbeat:IPaddr2): Started servappli02 Your did not only configured a simple VIP but a cluster IP which acts like a simple static loadbalancer ... man iptables ... search for CLUSTERIP. If this was not your intention, simply don't clone it. If you want a clusterip you have to choose correct meta attributes: clone ClusterIPClone ClusterIP \ meta globally-unique=true clone-node-max=2 interleave=true Clone Set: TomcatClone (unique) Tomcat:0 (ocf::heartbeat:tomcat):Started servappli01 Tomcat:1 (ocf::heartbeat:tomcat):Started servappli02 The 2 Tomcat servers as identical, and the same webapps are deployed on each server in order to be able to access webapps on the other server if one is down. By default, requests from clients are processed by the first server because it's the master. My problem is that when I crash the Tomcat on the first server, requests from clients are not redirected to the second server. For a while, requests are not processed, then Heartbeat restarts Tomcat itself and requests are processed again by the first server. Requests are never forwarded to the second Tomcat if the first is down. Default behavior on monitoring errors is a local restart. If you always test from the same IP I would expect your requests to fail while Tomcat is not running on the one node you are redirected ... so if you choose the clusterip_hash sourceip-sourceport your chance should be 50/50 to get redirected ... if you want a real loadbalancer you might want to integrate a service likde ldirectord with realserver checks to remove a non-working service from the loadbalancing. ... use ip addr show or define a label to see your VIP ... Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Here is my configuration : ha.cf file (the same on each server) : crm respawn logfacility local0 logfile /var/log/ha-log debugfile /var/log/ha-debug warntime10 deadtime20 initdead120 keepalive 2 autojoinnone nodeservappli01 nodeservappli02 ucast eth0 186.20.100.81 # ignored by node1 (owner of ip) ucast eth0 186.20.100.82 # ignored by node2 (owner of ip) cib.xml file (the same on each server) : ?xml version=1.0 ? cib admin_epoch=0 crm_feature_set=3.0.1 dc-uuid=bf8e9a46-8691-4838-82d9-942a13aeedca epoch=127 have-quorum=1 num_updates=51 validate-with=pacemaker-1.0 configuration crm_config cluster_property_set id=cib-bootstrap-options nvpair id=cib-bootstrap-options-dc-version name=dc-version value=1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87/ nvpair id=cib-bootstrap-options-cluster-infrastructure name=cluster-infrastructure value=Heartbeat/ nvpair id=cib-bootstrap-options-expected-quorum-votes name=expected-quorum-votes value=2/ nvpair id=cib-bootstrap-options-no-quorum-policy name=no-quorum-policy value=ignore/ nvpair id=cib-bootstrap-options-stonith-enabled name=stonith-enabled value=false/ /cluster_property_set /crm_config nodes node id=489a0305-862a-4280-bce5-6defa329df3f type=normal uname=servappli01/ node id=bf8e9a46-8691-4838-82d9-942a13aeedca type=normal uname=servappli02/ /nodes resources clone id=TomcatClone meta_attributes id=TomcatClone-meta_attributes nvpair id=TomcatClone-meta_attributes-globally-unique name=globally-unique value=true/ /meta_attributes primitive class=ocf id=Tomcat provider=heartbeat type=tomcat instance_attributes id=Tomcat-instance_attributes nvpair id=Tomcat-instance_attributes-tomcat_name name=tomcat_name value=TomcatSBNG/ nvpair
Re: [Linux-HA] Antw: Re: setting one resource of a group to unmanaged: undesired side-effects
On 11/10/2011 11:48 AM, Ulrich Windl wrote: Andreas Kurz andr...@hastexo.com schrieb am 09.11.2011 um 15:48 in Nachricht 4eba92ad.6050...@hastexo.com: On 11/09/2011 01:22 PM, Ulrich Windl wrote: Hi! I have a question : If I have a resource group with pacemaker (pacemaker-1.1.5-5.9.11.1 on SLES11 SP1) that has several resources, and I set one resource to unmanaged, the group should not be affected, right? I also have a colocation like colocation col_ip2 inf: prm_ip2 cln_foo:Slave When I set prm_ip2 to unmanaged the clone (ms) resource cln_foo did a promote (slave to master on node where prm_ip2 is, and a demote master to slave on another node) action. The idea of the colocation was to have the IP address on the node where the slave instance of cln_foo is running. I hope the setup was fine. Misconfiguration or software bug? Try using the whole group and not only the ip in the colocation constraint ... if the cluster sees no need to bind a resource to a specific node it feels free to relocate resources if scores ... or in this case promotion scores are the same. Though I am not sure if this behavior is intended for unmanaged resources in the case you described ... might be a feature ;-) Generally I suggest to use maintenance mode so you can be sure cluster does really nothing with any resource while you are doing your admin tasks. Good point: Can I activate maintenance mode with the crm shell? AFAIK, Maintenance mode as done by the gui just puts all resources in to unmanaged mode. Did I miss something? crm configure property maintenance-mode=true ... puts really all resources in unmanaged mode regardless of individual is-managed settings. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Reagards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: Q: colocations for clone resources, transitivity
On 11/09/2011 02:19 PM, Ulrich Windl wrote: Florian Haas flor...@hastexo.com schrieb am 09.11.2011 um 14:01 in Nachricht 4eba7997.9060...@hastexo.com: On 2011-11-09 13:36, Ulrich Windl wrote: Hi! I tried to co-locate a ocfs clone with a drbd ms-clone, and I tried to co-locate a CTDB clone with the ocfs clone also. The idea was that the OCFS filesystem is where the DRBD is, and the CTDB is where the filesystem is. So actually that is a transitive colocation like: CTDB - OCFS - DRDB I guess CRM can't handle that even if CTDB is to be started before OCFS, and OCFS before CTDB. Syslog has messages like: notice: clone_rsc_colocation_rh: Cannot pair prm_DLM:0 with instance of msc_drbd_r0 notice: clone_rsc_colocation_rh: Cannot pair prm_ctdb:0 with instance of cln_ocfs notice: clone_rsc_colocation_rh: Cannot pair prm_ctdb:1 with instance of cln_ocfs rsc_colocation id=col_ocfs_on_drbd_r0 rsc=cln_ocfs score=INFINITY with-rsc=msc_drbd_r0/ rsc_colocation id=col_ctdb_ocfs rsc=cln_ctdb score=INFINITY with-rsc=cln_ocfs/ It would be easier to maintain if resources would not require a multi-level colocation (like depending CTDB on DRBD and depending OCFS on DRBD). What is the easiest solution? Group them all, then clone the group. That would not work the general case, because you only need the ocfs framework once, the drbd framework less often, and the filesytem more often. Once you are having two DRBD devices ot two OCFS filesystems that won't work well. use resource-sets when groups are to inflexible Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] setting one resource of a group to unmanaged: undesired side-effects
On 11/09/2011 01:22 PM, Ulrich Windl wrote: Hi! I have a question : If I have a resource group with pacemaker (pacemaker-1.1.5-5.9.11.1 on SLES11 SP1) that has several resources, and I set one resource to unmanaged, the group should not be affected, right? I also have a colocation like colocation col_ip2 inf: prm_ip2 cln_foo:Slave When I set prm_ip2 to unmanaged the clone (ms) resource cln_foo did a promote (slave to master on node where prm_ip2 is, and a demote master to slave on another node) action. The idea of the colocation was to have the IP address on the node where the slave instance of cln_foo is running. I hope the setup was fine. Misconfiguration or software bug? Try using the whole group and not only the ip in the colocation constraint ... if the cluster sees no need to bind a resource to a specific node it feels free to relocate resources if scores ... or in this case promotion scores are the same. Though I am not sure if this behavior is intended for unmanaged resources in the case you described ... might be a feature ;-) Generally I suggest to use maintenance mode so you can be sure cluster does really nothing with any resource while you are doing your admin tasks. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] md group take over?
Hello, On 11/01/2011 11:29 PM, Miles Fidelman wrote: I've seen a few references to a resource agent for md group take over - but can't seem to find the actual agent or any documentation. mean ocf:heartbeat:ManageRAID RA? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Anybody know if it's real? Where to find more info? Thanks much, Miles Fidelman signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] SANs falling over, don't know why!
netmask to CIDR as: 24 Oct 30 06:02:51 iscsi1cl6 last message repeated 2 times Oct 30 06:02:52 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:4222124721766912 (Function Complete) Oct 30 06:02:52 iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:17 iscsi1cl6 last message repeated 24 times Oct 30 06:03:18 iscsi1cl6 kernel: iscsi_trgt: cmnd_rx_start(1849) 1 3b30 -7 Oct 30 06:03:18 iscsi1cl6 kernel: iscsi_trgt: cmnd_skip_pdu(459) 3b30 1 2a 4096 Oct 30 06:03:18 iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:49 iscsi1cl6 last message repeated 30 times Oct 30 06:04:32 iscsi1cl6 last message repeated 42 times Oct 30 06:04:33 iscsi1cl6 cib: [3769]: info: cib_stats: Processed 1 operations (1.00us average, 0% utilization) in the last 10min Oct 30 06:04:33 iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:05:04 iscsi1cl6 last message repeated 30 times Oct 30 06:05:41 iscsi1cl6 last message repeated 36 times Oct 30 06:05:42 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:5629499605320192 (Function Complete) Regards, James -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of James Smith Sent: 30 October 2011 00:25 To: General Linux-HA mailing list Subject: Re: [Linux-HA] SANs falling over, don't know why! Hi, Changed nothing to my knowledge :p These boxes don't currently have fencing enabled. I imagine the reboot is caused by a kernel panic, sysctl is set to reboot on this. There is one big 4TB LUN, used by several VMs on XenServer, each with multiple disks. In my quest to resolve, I have changed iet to use fileio instead of blockio and fiddled with some drbd performance related bits (http://www.drbd.org/users-guide/s-latency-tuning.html). If I'm woken up again tonight with this thing breaking it's going in the bin. I'll probably also ditch ietd and try open-iscsi or iscsi-scst. Monday morning I'll be shifting some load off this cluster also. Regards, James -Original Message- From: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andreas Kurz Sent: 29 October 2011 22:36 To: linux-ha@lists.linux-ha.org Subject: Re: [Linux-HA] SANs falling over, don't know why! Hello, On 10/29/2011 08:47 PM, James Smith wrote: Hi, All of a sudden, a SAN pair which was running without any problems for six months, now decides to fall over every couple of hours. So what did you change? ;-) The logs I have to go on are below: Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49 iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06 iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 2077806184s +512
Re: [Linux-HA] SANs falling over, don't know why!
Hello, On 10/29/2011 08:47 PM, James Smith wrote: Hi, All of a sudden, a SAN pair which was running without any problems for six months, now decides to fall over every couple of hours. So what did you change? ;-) The logs I have to go on are below: Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49 iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06 iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 2077806184s +512; pending: 2077806184s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1693425337s +3584; pending: 1693425337s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1693425321s +3584; pending: 1693425321s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1693425328s +512; pending: 1693425328s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1693425320s +512; pending: 1693425320s +512 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1743088585s +3584; pending: 1743088585s +3584 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local write detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s +512 Concurrent local writes Is there any kind of cluster software using a shared quorum disk or sthg. like that using this lun? Or this lun shared between several VMWare ESX VMs? Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24 After this event, both members of the SAN pair reboot. It is very disruptive, as it's killing the VMs using this SAN, requiring fsck's after failure. The load on the SAN doesn't need to be very high for this happen. They reboot because of a kernel panic, or because of some fencing mechanism? Running the following: CentOS 5 with kernel 2.6.18-274.7.1.el5 IET 1.4.20.2 Pacemaker 1.0.11-1.2.el5 DRBD 8.3.11 Would be interesting to see Pacemaker/DRBD/IET config Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cib.xml missing on a cluster node
On 10/28/2011 09:21 AM, Alessandra Giovanardi wrote: Hi, I solved the problem by using the hb_gui on the other node of the cluster. In that case the IPaddr resource shows the correct ocf/heartbeat CLASS/PROVIDER (on the other node only heartbeat was reported) and the new RG with the new IP are starting correctly. In effect seems a bug in the hb_gui...someone had the same problem on DEBIAN 2.1.3? this software is nearly four years old! ... do I have to say more? ;-) In your opinion can be a problem affecting the cluster functionality or only the management? I prefered manipulating an offline version of the cib when HB 2.1.3 was in, validation checked and replaced current config if all was ok. My opinion is: consider an update ... or hire someone that can assist you. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thank you Bye Alessandra On 10/28/2011 12:32 AM, Alessandra Giovanardi wrote: On 10/28/2011 12:12 AM, Andreas Kurz wrote: Hello, On 10/27/2011 11:54 PM, Alessandra Giovanardi wrote: Hi, I followed your suggestion and all was ok thanks...both nodes update correclty your configuration and takeover works fine. Anyway, after those operations, we tried to add another RG to our cluster, named group_univdrupal_prod with only one IP and we had some problem. The RG is correctly created but at the start (we disabled stonith first) an error occurs. I paste you the log at the end. Why the resource_univdrupal_IP_1 seems unmanaged? For any reason it was unable to start and then to stop the IP on gicdrupal01 ... Quite strange since if I start this IP on gicdrupal01 with: ifconfig eth1:0 130.186.99.43 netmask 255.255.255.0 up the IP is correctly configured on the interface... So it seems a heartbeat problem The only strange think I observe is that the IP resource still present into the first RG was heartbeat::ocf:IPaddr, while the new is created as: heartbeat:IPaddr These are different resource-agents ... use heartbeat:ocf:IPaddr2 ... yes, the one with the 2 at the end ;-) I tried also IPaddr2 without success (also in that case the resource is created without the :ocf: field-- why?, from hb_gui the only choices are IPaddr and IPaddr2 with hearbeat and not ocf/heartbeat as Class/Provider: pengine[27190]: 2011/10/27_22:35:53 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped pengine[27190]: 2011/10/27_22:35:53 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped pengine[27190]: 2011/10/27_22:35:54 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped pengine[27190]: 2011/10/27_22:36:05 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped pengine[27190]: 2011/10/27_22:36:05 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01 FAILED pengine[27190]: 2011/10/27_22:36:06 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01 (unmanaged) FAILED pengine[27190]: 2011/10/27_22:36:14 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01 (unmanaged) FAILED pengine[27190]: 2011/10/27_22:36:18 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped pengine[27190]: 2011/10/27_22:36:21 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01 FAILED pengine[27190]: 2011/10/27_22:36:22 notice: native_print: resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01 (unmanaged) FAILE It sound me like a problem of the hb_gui, which does not compromise the cluster features, but does not permit the RG creation... Some other suggestions to by-pass the problem? even if I select from the hb_GUI the IPaddr heartbeat... Furthermore the heartbeat release version seems the last available into the DEBIAN lenny stable repository: ii heartbeat 2.1.3-6lenny4 Subsystem for High-Availability Linux ii heartbeat-2 2.1.3-6lenny4 Subsystem for High-Availability Linux ii heartbeat-2-gui 2.1.3-6lenny4 Provides a gui interface to manage heartbeat clusters ii heartbeat-gui 2.1.3-6lenny4 really, really, really consider an upgrade ... even if you are on lenny, use latest pacemaker backports packages. Regards, Andreas Is quite complicated for us ;-) Thanks A. signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] cib.xml missing on a cluster node
On 10/28/2011 11:06 AM, Alessandra Giovanardi wrote: On 10/28/2011 10:35 AM, Andreas Kurz wrote: On 10/28/2011 09:21 AM, Alessandra Giovanardi wrote: Hi, I solved the problem by using the hb_gui on the other node of the cluster. In that case the IPaddr resource shows the correct ocf/heartbeat CLASS/PROVIDER (on the other node only heartbeat was reported) and the new RG with the new IP are starting correctly. In effect seems a bug in the hb_gui...someone had the same problem on DEBIAN 2.1.3? this software is nearly four years old! ... do I have to say more? ;-) right, but this cluster should be active only for the next three/four months and then we migrate all services on a farm (load balanced environment). Furthermore the services are very critical (no more than 5 minutes in offline mode)...so we would minimize all change... yes ... never touch a running system ;-) In your opinion can be a problem affecting the cluster functionality or only the management? I prefered manipulating an offline version of the cib when HB 2.1.3 was in, validation checked and replaced current config if all was ok. How could you validate the new cib (and also the current one?)? We need another cluster to test it? No need for another cluster to use the tools ... of course it is a good idea to have a test system before deploying a new config to critical systems. Try: cibadmin, crm_verify ... also xmllint might be helpful My opinion is: consider an update ... or hire someone that can assist you. Correct but as reported above this cluster should switch off in few months. Good luck! Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thank you. A. signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Standard dlm_controld from cman
hello, On 10/27/2011 03:25 AM, Nick Khamis wrote: Hello Everyone, I just compiled the latest version of cman (cluster 3.1.7) for standard dlm_controld support. My setup is as such (compiled from source, in the order): Glue 1.0.8 RA 3.9 Corosync 1.4.2 OpenAIS 1.1.4 Cluster 3.1.7 Pacemaker 1.1.6 dlm_controld: /usr/local/src/cluster-3.1.7/group/dlm_controld /usr/local/src/cluster-3.1.7/group/dlm_controld/dlm_controld.h /usr/local/src/cluster-3.1.7/group/dlm_controld/dlm_controld But the controld RA still complains: controld[20623]: ERROR: Setup problem: couldn't find command: dlm_controld.pcmk IIRC there is no Pacemaker specific controld in Cluster 3.1.x any more, this version is intended to be universal ... tried symlinking? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Issuing crm_report --features: 1.1.6 - 9971ebba4494012a93c03b40a2c58ec0eb60f50c: ncurses corosync-quorum corosync Do I need to re-install pacemaker for the cman dlm stuff to be included? This is for ocfs2. Thanks in Advance, Nick. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Basic Question about LVM
On 10/25/2011 02:06 AM, Bob Schatz wrote: Andreas, Thanks for your help! Comments below with [BS] From: Andreas Kurz andr...@hastexo.com To: linux-ha@lists.linux-ha.org Sent: Wednesday, October 19, 2011 3:36 AM Subject: Re: [Linux-HA] Basic Question about LVM Hello, On 10/18/2011 11:59 PM, Bob Schatz wrote: I am trying to setup a LVM fail over cluster and I must be missing something basic. :( The configuration I want is: IP address | File System | LVM I am not using DRBD. So you are using a shared storage device like a FC disk? [BS] Yes. We are using shared disk. I am running this on Ubuntu 10.04LS. Everything works fine and I can migrate the group between two nodes. However, if I reboot one node OR STONITH one node it causes the other node to stop and then restart all resources. The problem is that when the node reboots, it activates the LVM volume group and then Pacemaker says native_add_running: Resource ocf::LVM:volume-lvm-p appears to be active on 2 nodes. This causes the group to stop and then be restarted. I tried to play with /etc/lvm/lvm.conf filtering but that just prevented the disks from being read even by the agent. Adopted volume_list Parameter to not activate all vgs? After doing changes to lvm.conf you also need to recreate initramfs. Safest and recommended setup is to use clvmd with dlm and update your vg to a clustered one (also adopt the locking type in lvm.conf). Last step is to integrate dlm/clvmd/lvm in your Pacemaker setup and you are ready to go. [BS] Thanks. I am studying the various docs I have found on setting up clvmd and it looks pretty straightforward. One question I have is do I need to configure the o2cb resource? I do not want a shared file system and therefore I am not going to use OCFS2. It seemed to me that o2cb only makes sense if you have OCFS2. No need for o2cb if you are not using OCFS2. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks again, Bob Regards, Andreas signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] what if brain split happens
On 10/25/2011 07:51 PM, Hai Tao wrote: actually what I saw is that both nodes shut down heartbeat, and then restarted heartbeat I guess you are using Heartbeat v1 without crm but haresources config file? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Thanks. Hai Tao Date: Tue, 25 Oct 2011 10:35:53 -0700 From: david.l...@digitalinsight.com To: dmaz...@bmrb.wisc.edu; linux-ha@lists.linux-ha.org Subject: Re: [Linux-HA] what if brain split happens On Tue, 25 Oct 2011, Dimitri Maziuk wrote: On 10/24/2011 10:49 PM, Hai Tao wrote: In case heartbeat communication is lost, brain split then happened, both nodes (a two nodes cluster for a simple example) are having the vip and other resources. When the heartbeat commnication comes back, what will happen? 1. both nodes will still having the vip and resources forever? 2. both nodes realize that brain split has happened, and will restart heartbeat? In theory -- #2 except they shouldn't restart heartbeat, one of them should stop the resources. In practice one of the interesting things that happen when the comms come back is you have a duplicate ip address (vip) on your network. That's not something you want to happen, so you better make sure one of the nodes is down before you restore the comms. actually, I believe that what happens is that both nodes stop the resouce, and then one of the nodes starts it. this solves the dup-IP problem because starting the resource re-sends the appropriate ARP packets to clean up the network. David Lang ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [DRBD-user] Questions Regarding Configuration
Hello Nick, On 10/25/2011 02:43 AM, Nick Khamis wrote: Hello Andreas, I did not want to post the following to the list. hmmm ... was not exactly successful ;-) Thank you so much for your help thus far, it has enabled us to get up and running, and focus on other aspects of the project. Glad to help! I am slowly starting to learn the pcmk concept the hard way! ;) As for: Cluster file system for Asterisk? Are you sure it's worth adding that extra layer? The idea is to cluster asterisk providing both failover and load balancing. We will attempt to do this using an ocf resource agent implemented by hastexo, or using a proxy. For this reason we are liking the idea of using a network filesystem for the asterisk config files. It's just prototyped right now using virutal machines. So you really want it the extra hard way?! Cluster fs for some config files ... Good luck ;-) Yes of course, we can have an active/active asterisk cluster with each instance managing its own config files and therefore, eliminating ocfs/gfs etc.. Right now I have to figure out how ocf:pacemaker:controld works. And get myself away from the not installed error We have a working distributed locking mechanism however, I cannot find any good information on what the resource agent requires, how it works etc... Ubuntu 10.04 LTS - Lucid - is one the few distros having packages for all stacks (ocfs2/gfs2) see: http://martinloschwitz.wordpress.com/2011/10/24/updated-linux-cluster-stack-packages-for-ubuntu-10-04/ As you know, the good thing about learning the hard way, and making ALL the mistakes is that, I will be able to idenify, and point out the mistakes others make when seeking help from the mailing list. Nice to hear you want to give something back to the community. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Kind Regards, Nick from Toronto. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] PCMK + OCFS2
hello, On 10/25/2011 08:14 PM, Nick Khamis wrote: Hello Everyone, Moving forward, I noticed that there was not much documentation regarding getting the pcmk stack working with ocfs2. I have the configuration up and running however, missed the part regarding getting what is required for pcmk+ocfs support (to get ocf:pacemaker:controld + ocf:pacemaker:o2cb working). yes, there is not that much documentation on that part. Everything is build from source using the latest version of Glue, RA, PCMK, and OpenAIS. OCFS2 works fine manually, and now I am trying to get corosync to handle it. This is on a prototype environment right now using Debian Squeeze however, will be using an EL like Red Hat for production. No stonith is required just yet, but if the documentation includes that as well it would be beneficial very soon. When using a cluster fs fencing is not an option but an obligation! If you already know your want to go with RHEL or some derivate you can save some miles not going the OCFS2 path I can recommend reading: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/index.html ... you will see this is quite different from running OCFS2 on lets say Debian. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Resources didn't failover when failed
Hello, On 10/23/2011 11:30 AM, James Smith wrote: Hi, I was presented with the following status of a two node cluster: [root@iscsi1cl2 primestaff]# crm_mon -fN Attempting connection to the cluster... Last updated: Sat Oct 22 18:10:07 2011 Stack: openais Current DC: iscsi1cl2 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 2 Nodes configured, 2 expected votes 2 Resources configured. Online: [ iscsi1cl2 iscsi2cl2 ] Master/Slave Set: iscsidrbdclone Masters: [ iscsi2cl2 ] Slaves: [ iscsi1cl2 ] Resource Group: coregroup ClusterIP (ocf::heartbeat:IPaddr2): Started iscsi2cl2 iscsitarget(ocf::heartbeat:iSCSITarget): Started iscsi2cl2 FAILED iscsilun (ocf::heartbeat:iSCSILogicalUnit): Started iscsi2cl2 (unmanaged) FAILED mail_me(ocf::heartbeat:MailTo):Stopped Migration summary: * Node iscsi2cl2: iscsitarget: migration-threshold=3 fail-count=1 iscsilun: migration-threshold=3 fail-count=100 ClusterIP: migration-threshold=3 fail-count=1 * Node iscsi1cl2: Failed actions: iscsitarget_monitor_1000 (node=iscsi2cl2, call=16, rc=-2, status=Timed Out): unknown exec error iscsilun_stop_0 (node=iscsi2cl2, call=24, rc=-2, status=Timed Out): unknown exec error Error on stop -- no fencing configured. Cluster does not know if resource is still running and blocks. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now In this instance, I am wondering why the resource didn't failover to the awaiting secondary server? :( Config is below: crm(live)configure# show node iscsi1cl2 \ attributes standby=off node iscsi2cl2 \ attributes standby=off primitive ClusterIP ocf:heartbeat:IPaddr2 \ params ip=10.100.0.101 cidr_netmask=255.255.255.0 nic=vlan158 \ op monitor interval=1s \ meta migration-threshold=3 primitive iscsidrbd ocf:linbit:drbd \ params drbd_resource=iscsidisk \ op monitor interval=15s role=Master timeout=30s \ op monitor interval=16s role=Slave timeout=31s \ meta migration-threshold=3 primitive iscsilun ocf:heartbeat:iSCSILogicalUnit \ params implementation=iet lun=2 target_iqn=iqn.2010-05.iscsicl2:LUN02.sanvol path=/dev/drbd0 scsi_id=19101000101cl2iscsi \ op monitor interval=1s timeout=5s \ meta target-role=Started migration-threshold=3 primitive iscsitarget ocf:heartbeat:iSCSITarget \ params implementation=iet iqn=iqn.2010-05.iscsicl2:LUN02.sanvol portals= \ meta target-role=Started migration-threshold=3 \ op monitor interval=1s timeout=5s primitive mail_me ocf:heartbeat:MailTo \ params email=a...@a.com \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s \ op monitor interval=10 timeout=10 depth=0 group coregroup ClusterIP iscsitarget iscsilun mail_me ms iscsidrbdclone iscsidrbd \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true target-role=Started migration-threshold=3 colocation core_group-with-iscsidrbdclone inf: coregroup iscsidrbdclone:Master order iscsidrbdclone-before-core_group inf: iscsidrbdclone:promote iscsitarget:start property $id=cib-bootstrap-options \ dc-version=1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ last-lrm-refresh=1315228396 rsc_defaults $id=rsc-options \ resource-stickiness=100 Regards, James ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [DRBD-user] Ensuring drbd is started before mounting filesystem
On 10/23/2011 11:18 PM, Nick Khamis wrote: Hello Everyone, I was wondering if it's possible to use the order directive to ensure that drbd is fully started before attempting to mount the filesystem? I tried the following: node mydrbd1 \ attributes standby=off node mydrbd2 \ attributes standby=off primitive myIP ocf:heartbeat:IPaddr2 \ op monitor interval=60 timeout=20 \ params ip=192.168.2.5 cidr_netmask=24 \ nic=eth1 broadcast=192.168.2.255 \ lvs_support=true primitive myDRBD ocf:linbit:drbd \ params drbd_resource=r0.res \ op monitor role=Master interval=10 \ op monitor role=Slave interval=30 ms msMyDRBD myDRBD \ meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 \ notify=true globally-unique=false primitive myFilesystem ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/service fstype=ext3 \ op monitor interval=15 timeout=60 \ meta target-role=Started group MyServices myIP myFilesystem meta target-role=Started order drbdAfterIP \ inf: myIP msMyDRBD order filesystemAfterDRBD \ inf: msMyDRBD:promote myFilesystem:start There is no colocation between the DRBD master and the filesystem location prefer-mysql1 MyServices inf: mydrbd1 location prefer-mysql2 MyServices inf: mydrbd2 ? ... these constraints make no sense ... Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [DRBD-user] Ensuring drbd is started before mounting filesystem
Sorry for the noise ... address cut paste error Regards, Andreas On 10/24/2011 12:12 AM, Andreas Kurz wrote: On 10/23/2011 11:18 PM, Nick Khamis wrote: Hello Everyone, I was wondering if it's possible to use the order directive to ensure that drbd is fully started before attempting to mount the filesystem? I tried the following: node mydrbd1 \ attributes standby=off node mydrbd2 \ attributes standby=off primitive myIP ocf:heartbeat:IPaddr2 \ op monitor interval=60 timeout=20 \ params ip=192.168.2.5 cidr_netmask=24 \ nic=eth1 broadcast=192.168.2.255 \ lvs_support=true primitive myDRBD ocf:linbit:drbd \ params drbd_resource=r0.res \ op monitor role=Master interval=10 \ op monitor role=Slave interval=30 ms msMyDRBD myDRBD \ meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 \ notify=true globally-unique=false primitive myFilesystem ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/service fstype=ext3 \ op monitor interval=15 timeout=60 \ meta target-role=Started group MyServices myIP myFilesystem meta target-role=Started order drbdAfterIP \ inf: myIP msMyDRBD order filesystemAfterDRBD \ inf: msMyDRBD:promote myFilesystem:start There is no colocation between the DRBD master and the filesystem location prefer-mysql1 MyServices inf: mydrbd1 location prefer-mysql2 MyServices inf: mydrbd2 ? ... these constraints make no sense ... Regards, Andreas ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] [DRBD-user] Questions Regarding Configuration
On 10/23/2011 09:39 PM, Nick Khamis wrote: The following works as expected: node mydrbd1 \ attributes standby=off node mydrbd2 \ attributes standby=off primitive myIP ocf:heartbeat:IPaddr2 \ op monitor interval=60 timeout=20 \ params ip=192.168.2.5 cidr_netmask=24 \ nic=eth1 broadcast=192.168.2.255 \ lvs_support=true primitive myDRBD ocf:linbit:drbd \ params drbd_resource=r0.res \ op monitor role=Master interval=10 \ op monitor role=Slave interval=30 ms msMyDRBD myDRBD \ meta master-max=1 master-node-max=1 \ clone-max=2 clone-node-max=1 \ notify=true globally-unique=false group MyServices myIP order drbdAfterIP \ inf: myIP msMyDRBD location prefer-mysql1 MyServices inf: mydrbd1 location prefer-mysql2 MyServices inf: mydrbd2 ?? property $id=cib-bootstrap-options \ no-quorum-policy=ignore \ stonith-enabled=false \ expected-quorum-votes=5 \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-recheck-interval=0 \ cluster-infrastructure=openais rsc_defaults $id=rsc-options \ resource-stickiness=100 However, when modifying the order entry to: order drbdAfterIP \ inf: myIP:promote msMyDRBD:start DRBD no longer works. And when adding the following colocation: yes, the promote of the IP will never happen as it is a) only configured as primitve and b) IPaddr2 does not support a promote action ... no IP promote, no DRBD start ... colocation drbdOnIP \ inf: MyServices msMyDRBD:Master none of the resources work. tried removing those obscure two location constraints? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Inconsistencies between LRMD and CRM + Problem with MySQL OCF
Hello, On 10/22/2011 04:25 AM, Nick Khamis wrote: Hello Everyone, I have been strugglling with the MySQL OCF. On an unrelated, eyeing the logs I saw that some of the resources are shown as not running however. according to CRM, and checking them manually, resources such as myIP and myFilesystem are running: Online: [ mydrbd1 mydrbd2 ] OFFLINE: [ lb2 lb1 astdrbd1 astdrbd2 ] Resource Group: MyServices myIP (ocf::heartbeat:IPaddr2): Started mydrbd2 myFilesystem (ocf::heartbeat:Filesystem):Started mydrbd2 Master/Slave Set: msMyDRBD [myDRBD] Masters: [ mydrbd2 ] Slaves: [ mydrbd1 ] Failed actions: mysql_monitor_0 (node=mydrbd1, call=2, rc=5, status=complete): not installed mysql_monitor_0 (node=mydrbd2, call=2, rc=5, status=complete): not installed So mysql is not installed on nodes mydrbd1/2 ... probing them leads to above error ... all is fine if mysql should never run there and is therefore not installed. You don't get rid of this errors unless you install mysql there (or some form of dummy). Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing key=9:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0 op=mysql_monitor_0 ) Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:mysql:2: probe Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing key=10:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0 op=myIP_monitor_0 ) Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:myIP:3: probe Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing key=11:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0 op=myFilesystem_monitor_0 ) Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:myFilesystem:4: probe Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing key=12:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0 op=myDRBD:0_monitor_0 ) Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:myDRBD:0:5: probe Oct 21 21:58:10 mydrbd2 crmd: [22525]: info: process_lrm_event: LRM operation mysql_monitor_0 (call=2, rc=5, cib-update=9, confirmed=true) not installed Oct 21 21:58:10 mydrbd2 crmd: [22525]: info: process_lrm_event: LRM operation myIP_monitor_0 (call=3, rc=7, cib-update=10, confirmed=true) not running Oct 21 21:58:11 mydrbd2 crmd: [22525]: info: process_lrm_event: LRM operation myFilesystem_monitor_0 (call=4, rc=7, cib-update=11, confirmed=true) not running These are all expected results of the initial probing (monitor_0) of all resources before the Cluster starts with resource allocation rc=7 if a resource is not running, rc5 if necessary binarys/configs are not installed. All informational stuff, cluster needs to know if there is any resource already started outside of the cluster no need to worry ;-) Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now That is all I could find in regards to the MySQL related errors. Using the latest version of pcmk build from source. Also tried the newest version of MySQL OCF downloaded from GIT. Last updated: Fri Oct 21 22:13:08 2011 Last change: Fri Oct 21 21:54:46 2011 via cibadmin on mydrbd1 Stack: openais Current DC: mydrbd1 - partition WITHOUT quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 6 Nodes configured, 5 expected votes 5 Resources configured. Thanks in Advance, Nick. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Basic Question about LVM
Hello, On 10/18/2011 11:59 PM, Bob Schatz wrote: I am trying to setup a LVM fail over cluster and I must be missing something basic. :( The configuration I want is: IP address | File System | LVM I am not using DRBD. So you are using a shared storage device like a FC disk? I am running this on Ubuntu 10.04LS. Everything works fine and I can migrate the group between two nodes. However, if I reboot one node OR STONITH one node it causes the other node to stop and then restart all resources. The problem is that when the node reboots, it activates the LVM volume group and then Pacemaker says native_add_running: Resource ocf::LVM:volume-lvm-p appears to be active on 2 nodes. This causes the group to stop and then be restarted. I tried to play with /etc/lvm/lvm.conf filtering but that just prevented the disks from being read even by the agent. Adopted volume_list Parameter to not activate all vgs? After doing changes to lvm.conf you also need to recreate initramfs. Safest and recommended setup is to use clvmd with dlm and update your vg to a clustered one (also adopt the locking type in lvm.conf). Last step is to integrate dlm/clvmd/lvm in your Pacemaker setup and you are ready to go. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now What am I missing? Thanks, Bob My configuration is: node cc-vol-6-1 node cc-vol-6-2 primitive ipmilan-cc-vol-6-1 stonith:external/ipmi \ params hostname=cc-vol-6-1 ipaddr=XXX userid=XXX passwd=XXX \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 \ op monitor interval=60 timeout=60 start-delay=0 primitive ipmilan-cc-vol-6-2 stonith:external/ipmi \ params hostname=cc-vol-6-2 ipaddr=XXX userid=XX passwd=X! \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 \ op monitor interval=60 timeout=60 start-delay=0 primitive volume-fs-p ocf:heartbeat:Filesystem \ params device=/dev/nova-volumes/nova-volumes-vol directory=/volume-mount fstype=xfs \ op start interval=0 timeout=60 \ op monitor interval=60 timeout=60 OCF_CHECK_LEVEL=20 \ op stop interval=0 timeout=120 primitive volume-iscsit-ip-p ocf:heartbeat:IPaddr2 \ params ip=YY nic=ZZ \ op monitor interval=5s primitive volume-lvm-p ocf:heartbeat:LVM \ params volgrpname=nova-volumes exclusive=true \ op start interval=0 timeout=30 \ op stop interval=0 timeout=30 primitive volume-vol-ip-p ocf:heartbeat:IPaddr2 \ params ip=X1x1x1x1x nic=y1y1y1 \ op monitor interval=5s group volume-fs-ip-iscsi-g volume-lvm-p volume-fs-p volume-iscsit-ip-p volume-vol-ip-p \ meta target-role=Started location loc-ipmilan-cc-vol-6-1 ipmilan-cc-vol-6-1 -inf: cc-vol-6-1 location loc-ipmilan-cc-vol-6-2 ipmilan-cc-vol-6-2 -inf: cc-vol-6-2 property $id=cib-bootstrap-options \ dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=true rsc_defaults $id=rsc-options \ resource-stickiness=1000 ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] violate uniqueness for parameter drbd_resource
Hello, On 10/19/2011 04:49 PM, Nick Khamis wrote: Hello Everyone, What we have is a 4 node cluster: 2 Running mysql on a active/passive, and 2 running our application on an active/active: MyDRBD1 and MyDRBD2: Mysql, DRBD (active/passive) ASTDRBD1 and ASTDRBD2: In-house application, DRBD dual primary A snippet of our config looks like this: node mydrbd1 \ attributes standby=off node mydrbd2 \ attributes standby=off node astdrbd1 \ attributes standby=off node astdrbd2 \ attributes standby=off primitive drbd_mysql ocf:linbit:drbd \ params drbd_resource=r0.res \ op monitor role=Master interval=10 \ op monitor role=Slave interval=30 . primitive drbd_asterisk ocf:linbit:drbd \ params drbd_resource=r0.res \ op monitor interval=20 timeout=20 role=Master \ op monitor interval=30 timeout=20 role=Slave ms ms_drbd_asterisk drbd_asterisk \ meta master-max=2 notify=true \ interleave=true group MyServices myIP fs_mysql mysql \ meta target-role=Started group ASTServices astIP asteriskDLM asteriskO2CB fs_asterisk \ meta target-role=Started . I am recieving the following warning: WARNING: Resources drbd_asterisk,drbd_mysql violate uniqueness for parameter drbd_resource: r0.res Now the obvious thing to do is to change the resource name at the DRBD level however, I assumed that the parameter uniqueness was bound to the primitive? Only one resource per cluster should use this value for this attribute if it is marked globally-unique in the RA meta-information. Do yourself a favour and give the DRBD resources a meaningful name, how about asterisk and mysql ;-) My second quick question is, I like to use group + location to single out services on specific nodes however, when creating clones: clone cloneDLM asteriskDLM meta globally-unique=false interleave=true I am recieving ERROR: asteriskDLM already in use at ASTServices error? My question is, what are the benefits of using group + location vs. clone + location? Once a resource is in a group it cannot be used for clones/MS any more ... though you can clone a group or make it MS. With the latter I assue we will have a long list of location (one for each primitive + node)? And with the former we do not have he meta information (globally-unique, and interleave)? I assume you want to manage a cluster filesystem ... so put all the dlm/o2cb/cluster-fs resources in a group and clone it (and use interleave for this clone) Regards, Andreas -- Need help with Pacemaker or DRBD? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] IPaddr / ifconfig deprecated
Hello, On 10/19/2011 04:35 PM, alain.mou...@bull.net wrote: Hi Florian, just for information, following my remark last week on mysql option -O deprecated, I also noticed in the script IPaddr the use of ifconfig Please use IPaddr2 RA for current setups ... IPaddr is only here for backwards compatibility and for platforms without ip utility. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now command which is flagged as deprecated (at least on RH) and this generates lots of useless syslog messages. So I replace all the $IFCONFIG functions in IPaddr script (but only for SYSTYPE=Linux as I'm on RHEL6) , and this seems to work fine. function delete_interface : CMD=ip addr del $ipaddr dev $ifname;; function find_generic_interface : ifname=`ip addr | grep $ipaddr | awk '{print $NF}'` case $ifname in *:*) echo $ifname; return $OCF_SUCCESS ;; *) return $OCF_ERR_GENERIC;; esac function find_free_interface : IFLIST=`ip addr | grep eth1:[0-9] | awk '{print $NF}'` function add_interface : CMD=`ip addr add $ipaddr/$CidrNetmask broadcast $broadcast dev $iface_base label $iface`;; providing the CidrNetmask is retrieved in hex format from the iface_base line, or given in hex format in the pacemaker primitive Just a few suggestions which seems to work ... Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with corosync and drbd 8.4.0
On 10/18/2011 09:44 AM, SINN Andreas wrote: Hello! Thanks. Now it works on one node, but on the other, I get the following error in the messages: ERROR: Couldn't find device [/dev/drbd0]. Expected /dev/??? to exist But the device exists: ls -la /dev/drbd0 brw-rw 1 root disk 147, 0 Oct 18 09:37 /dev/drbd0 and when I start only the drbd, it works fine. Sorry, not enough information ... hard to comment on one single log line, full log is needed. What is the drbd status on both nodes when this error occours ... cat /proc/drbd? Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now Please help. Thanks Andreas -Ursprüngliche Nachricht- Von: linux-ha-boun...@lists.linux-ha.org [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Andreas Kurz Gesendet: Montag, 17. Oktober 2011 15:22 An: linux-ha@lists.linux-ha.org Betreff: Re: [Linux-HA] Problem with corosync and drbd 8.4.0 Hello, On 10/17/2011 02:53 PM, SINN Andreas wrote: Hello! I have installed drbd 8.4.0 on RHEL 6 and want to build a cluster with corosync. The drbd runs without any problem. When I configure with crm and want to start, I get the following error in the crm_mon: Failed actions: data_monitor_0 (node=cl-sftp-server1, call=2, rc=6, status=complete): not configured data_monitor_0 (node=cl-sftp-server2, call=2, rc=6, status=complete): not configured When I do a crm_verify -LV , I get the following errors: crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error - data_monitor_0 failed with rc=6: Preventing data from re-starting anywhere in the cluster crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error - data_monitor_0 failed with rc=6: Preventing data from re-starting anywhere in the cluster cat /etc/drbd.conf global { usage-count yes; } common { net { protocol C; } } resource data { meta-disk internal; device/dev/drbd0; syncer { verify-alg sha1; } net { allow-two-primaries; ^ Don't enable this if you don't know what you are doing ... } on cl-sftp-server1 { disk /dev/sda3; address10.100.49.101:7790; } on cl-sftp-server2 { disk /dev/sda3; address10.100.49.102:7790; } crm configure show: node cl-sftp-server1 node cl-sftp-server2 primitive data ocf:linbit:drbd \ params drbd_resource=data \ op monitor interval=60s \ op monitor interval=10 role=Master \ op monitor interval=30 role=Slave property $id=cib-bootstrap-options \ dc-version=1.0.11-a15ead49e20f047e129882619ed075a65c1ebdfe \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ default-action-timeout=240 \ stonith-enabled=false Can you someone help? What is the failure? You need to configure a Master/Slave rescource, not only the primitive...e.g: master ms_data data meta notify=true should solve your problem ... and reading through your logs should also reveal this ;-) Regards, Andreas -- Need help with Pacemaker or DRBD? http://www.hastexo.com/now ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with corosync and drbd 8.4.0
Hello, On 10/17/2011 02:53 PM, SINN Andreas wrote: Hello! I have installed drbd 8.4.0 on RHEL 6 and want to build a cluster with corosync. The drbd runs without any problem. When I configure with crm and want to start, I get the following error in the crm_mon: Failed actions: data_monitor_0 (node=cl-sftp-server1, call=2, rc=6, status=complete): not configured data_monitor_0 (node=cl-sftp-server2, call=2, rc=6, status=complete): not configured When I do a crm_verify -LV , I get the following errors: crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error - data_monitor_0 failed with rc=6: Preventing data from re-starting anywhere in the cluster crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error - data_monitor_0 failed with rc=6: Preventing data from re-starting anywhere in the cluster cat /etc/drbd.conf global { usage-count yes; } common { net { protocol C; } } resource data { meta-disk internal; device/dev/drbd0; syncer { verify-alg sha1; } net { allow-two-primaries; ^ Don't enable this if you don't know what you are doing ... } on cl-sftp-server1 { disk /dev/sda3; address10.100.49.101:7790; } on cl-sftp-server2 { disk /dev/sda3; address10.100.49.102:7790; } crm configure show: node cl-sftp-server1 node cl-sftp-server2 primitive data ocf:linbit:drbd \ params drbd_resource=data \ op monitor interval=60s \ op monitor interval=10 role=Master \ op monitor interval=30 role=Slave property $id=cib-bootstrap-options \ dc-version=1.0.11-a15ead49e20f047e129882619ed075a65c1ebdfe \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ default-action-timeout=240 \ stonith-enabled=false Can you someone help? What is the failure? You need to configure a Master/Slave rescource, not only the primitive...e.g: master ms_data data meta notify=true should solve your problem ... and reading through your logs should also reveal this ;-) Regards, Andreas -- Need help with Pacemaker or DRBD? http://www.hastexo.com/now ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Ordered resources
On 2011-05-13 08:55, Maxim Ianoglo wrote: Hello, I have the following configuration: Nodes: Node_A and Node_B Resources: WWW ( gr_apache_www ), NFS Server ( gr_storage_server ), NFS Client ( gr_storage_client ) Locations: gr_apache_www: By Default on Node_A, failover to Node_B gr_storage_server: By Default on Node_A, failover to Node_B gr_storage_client: By Default on Node_A, failovers just in case Node_A was brought back online but gr_storage_server will not be moved to it's default location for now, but gr_apache_www will. Constraints: colocation colo_storage -inf: gr_storage_client gr_storage_server order ord_storage inf: gr_storage_server gr_storage_client order ord_www inf: gr_storage_server ( gr_apache_www_main ) order ord_www2 inf: gr_storage_client ( gr_apache_www_main ) Now I have the situation: I put Node_A in standby so ALL resources should go to Node_B ( except for gr_storage_client ), but for some reason only gr_storage_server is moved to Node_B. gr_apache_www is not even started. How can I make gr_apache_www start even if gr_storage_client is not running anywhere ? But if it is running anywhere it should run after gr_storage_client. order ord_www2 0: gr_storage_client gr_apache_www_main ... an advisory order constraint. Regards, Andreas Configuration ( constraints and cluster options ): location loc_gr_apache_www_default gr_apache_www \ rule $id=prefered_loc_gr_apache_default 100: #uname eq Node_A location loc_gr_apache_www_failover gr_apache_www \ rule $id=prefered_loc_gr_apache_failover 50: #uname eq Node_B location loc_gr_storage_server_default gr_storage_server \ rule $id=prefered_loc_gr_storage_server_default 100: #uname eq Node_A location loc_gr_storage_server_failover gr_storage_server \ rule $id=prefered_loc_gr_storage_server_failover 50: #uname eq Node_B colocation colo_storage -inf: gr_storage_client gr_storage_server order ord_nfslock_storage_client inf: gr_storage_client clone_nfslock order ord_nfslock_storage_server inf: gr_storage_server clone_nfslock order ord_storage inf: gr_storage_server gr_storage_client order ord_www inf: gr_storage_server ( gr_nginx_static gr_apache_www_main ) order ord_www2 inf: gr_storage_client ( gr_nginx_static gr_apache_www_main ) property $id=cib-bootstrap-options \ symmetric-cluster=true \ no-quorum-policy=ignore \ stonith-enabled=false \ stonith-action=reboot \ startup-fencing=true \ stop-orphan-resources=true \ stop-orphan-actions=true \ remove-after-stop=false \ default-action-timeout=60s \ is-managed-default=true \ cluster-delay=60s \ pe-error-series-max=-1 \ pe-warn-series-max=-1 \ pe-input-series-max=-1 \ dc-version=1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87 \ last-lrm-refresh=1305263051 \ cluster-infrastructure=Heartbeat rsc_defaults $id=rsc-options \ resource-stickiness=100 === Thank you. -- Maxim Ianoglo ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems signature.asc Description: OpenPGP digital signature ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] CIB process quits and could not connect to CRM
On 03/10/2011 06:30 AM, Tiruvenkatasamy Baskaran wrote: Hi, I have installed heartbeat-3.0.3-2.el5.x86_64.rpm and pacemaker-1.1.2-7.el6.x86_64.rpm on RHEL. I'm quite sure RHELs version is compiled without Heartbeat support ... try corosync. Regards, Andreas I have configured ha.cf ,authkeys and cib.xml as follows. When I start heartbeart ,it will in turn start the crm as crm is configured in ha.cf file. With this mail ha-log file is attached I could not connect to CRM [root@pcmk-1 crm]# crm configure show Signon to CIB failed: connection failed Init failed, could not perform requested operations ERROR: cannot parse xml: no element found: line 1, column 0 ERROR: No CIB! If I look into log the following message is found Mar 09 17:49:25 pcmk-2 stonith-ng: [5328]: CRIT: get_cluster_type: This installation of Pacemaker does not support the '(null)' cluster infrastructure. Terminating. It was starting cib process Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client /usr/lib64/heartbeat/cib (495,489) But after some time cib process quits Mar 09 17:49:26 pcmk-2 heartbeat: [5313]: WARN: Managed /usr/lib64/heartbeat/cib process 5326 exited with return code 100. Can anyone tell me why cib process quits and This installation of Pacemaker does not support the '(null)' cluster infrastructure. Terminating. is being displayed For more info look into ha.log file ha.cf logfile /var/log/ha-log logfacility local0 keepalive 2 deadtime 30 initdead 120 udpport 694 bcast eth0# Linux auto_failback on node pcmk-1 pcmk-2 crm respawn authkeys auth 2 2 sha1 test-ha cib.xml cib configuration crm_config cluster_property_set id=cib-bootstrap-options attributes/ /cluster_property_set /crm_config nodes node uname=pcmk-1 type=normal id=f11899c3-ed6e-4e63-abae-b9af90c62283/ node uname=pcmk-2 type=normal id=663bae4d-44a0-407f-ac14-389150407159/ /nodes resources/ constraints/ /configuration /cib ha-log Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: info: Version 2 support: respawn Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: WARN: File /etc/ha.d//haresources exists. Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: WARN: This file is not used because crm is enabled Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: WARN: Logging daemon is disabled --enabling logging daemon is recommended Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: info: ** Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: info: Configuration validated. Starting heartbeat 3.0.2 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: heartbeat: version 3.0.2 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: Heartbeat generation: 1299567998 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth0 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: glib: UDP Broadcast heartbeat closed on port 694 interface eth0 - Status: 1 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: G_main_add_TriggerHandler: Added signal manual handler Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: G_main_add_TriggerHandler: Added signal manual handler Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: G_main_add_SignalHandler: Added signal handler for signal 17 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: Local status now set to: 'up' Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: WARN: node pcmk-1: is dead Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Comm_now_up(): updating status to active Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Local status now set to: 'active' Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client /usr/lib64/heartbeat/ccm (495,489) Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client /usr/lib64/heartbeat/cib (495,489) Mar 09 17:49:25 pcmk-2 heartbeat: [5325]: info: Starting /usr/lib64/heartbeat/ccm as uid 495 gid 489 (pid 5325) Mar 09 17:49:25 pcmk-2 heartbeat: [5326]: info: Starting /usr/lib64/heartbeat/cib as uid 495 gid 489 (pid 5326) Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client /usr/lib64/heartbeat/lrmd -r (0,0) Mar 09 17:49:25 pcmk-2 heartbeat: [5327]: info: Starting /usr/lib64/heartbeat/lrmd -r as uid 0 gid 0 (pid 5327) Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client /usr/lib64/heartbeat/stonithd (0,0) Mar 09 17:49:25 pcmk-2 heartbeat: [5328]: info: Starting /usr/lib64/heartbeat/stonithd as uid 0 gid 0 (pid 5328) Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client /usr/lib64/heartbeat/attrd (495,489) Mar 09 17:49:25 pcmk-2 heartbeat: [5329]: info: Starting /usr/lib64/heartbeat/attrd as uid 495 gid 489 (pid 5329) Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client /usr/lib64/heartbeat/crmd (495,489) Mar 09