Re: [Linux-HA] pcs configuration issue

2013-12-10 Thread Andreas Kurz
Hi Willi,

On 2013-12-09 07:58, Willi Fehler wrote:
 Hi Chris,
 
 I've upgraded to CentOS-6.5 with the latest version of pcs but the issue
 still exists.
 
 [root@linsrv006 ~]# pcs constraint location mysql rule score=pingd
 defined pingd
 Error: 'mysql' is not a resource
 
 Do I need to download the latest pcs code and build my own package?

pcs is not mandatory and you can download and use the crm shell for
Centos from:

http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/RedHat_RHEL-6/x86_64/


Regards,
Andreas

 
 Regards - Willi
 
 
 
 Am 03.12.13 01:54, schrieb Chris Feist:
 On 11/26/2013 03:27 AM, Willi Fehler wrote:
 Hello,

 I'm trying to create the following setup in Pacemaker 1.1.10.

 pcs property set no-quorum-policy=ignore
 pcs property set stonith-enabled=false
 pcs resource create drbd_mysql ocf:linbit:drbd drbd_resource=r0 op
 monitor
 interval=60s
 pcs resource master ms_drbd_mysql drbd_mysql master-max=1
 master-node-max=1
 clone-max=2 clone-node-max=1 notify=true
 pcs resource create fs_mysql Filesystem device=/dev/drbd/by-res/r0
 directory=/var/lib/mysql fstype=xfs options=noatime
 pcs resource create ip_mysql IPaddr2 ip=192.168.0.12 cidr_netmask=32
 op monitor
 interval=20s
 pcs resource create ping ocf:pacemaker:ping host_list=192.168.0.1
 multiplier=100 dampen=10s op monitor interval=60s
 pcs resource clone ping ping_clone globally-unique=false
 pcs resource create mysqld ocf:heartbeat:mysql binary=/usr/sbin/mysqld
 datadir=/var/lib/mysql config=/etc/my.cnf \
 pid=/var/run/mysqld/mysqld.pid socket=/var/run/mysqld/mysqld.sock \
 op monitor interval=15s timeout=30s op start interval=0
 timeout=180s op
 stop interval=0 timeout=300s
 pcs resource group add mysql fs_mysql mysqld ip_mysql
 pcs constraint colocation add mysql ms_drbd_mysql INFINITY
 with-rsc-role=Master
 pcs constraint order promote ms_drbd_mysql then start mysql
 pcs constraint location mysql rule pingd: defined ping

 There are two issues here, first, there's a bug in pcs which doesn't
 recognize groups in location constraint rules and the second is that
 the pcs rule syntax is slightly different than crm.

 You should be able to use this command with the latest upstream:
 pcs constraint location mysql rule score=pingd defined pingd

 Thanks,
 Chris


 The last line is not working:

 [root@linsrv006 ~]# pcs constraint location mysql rule pingd: defined
 pingd
 Error: 'mysql' is not a resource

 By the way, can everybody verify the other lines? I'm very now to
 pcs. Here is
 my old configuration.

 crm configure
 crm(live)configure#primitive drbd_mysql ocf:linbit:drbd \
 params drbd_resource=r0 \
 op monitor interval=10s role=Master \
 op monitor interval=20s role=Slave \
 op start interval=0 timeout=240 \
 op stop interval=0 timeout=240
 crm(live)configure#ms ms_drbd_mysql drbd_mysql \
 meta master-max=1 master-node-max=1 \
 clone-max=2 clone-node-max=1 \
 notify=true target-role=Master
 crm(live)configure#primitive fs_mysql ocf:heartbeat:Filesystem \
 params device=/dev/drbd/by-res/r0
 directory=/var/lib/mysql fstype=xfs options=noatime \
 op start interval=0 timeout=180s \
 op stop interval=0 timeout=300s \
 op monitor interval=60s
 crm(live)configure#primitive ip_mysql ocf:heartbeat:IPaddr2 \
 params ip=192.168.0.92 cidr_netmask=24 \
 op monitor interval=20
 crm(live)configure#primitive ping_eth0 ocf:pacemaker:ping \
 params host_list=192.168.0.1
 multiplier=100 \
 op monitor interval=10s timeout=20s \
 op start interval=0 timeout=90s \
 op stop interval=0 timeout=100s
 crm(live)configure#clone ping_eth0_clone ping_eth0 \
 meta globally-unique=false
 crm(live)configure#primitive mysqld ocf:heartbeat:mysql \
 params binary=/usr/sbin/mysqld
 datadir=/var/lib/mysql config=/etc/my.cnf
 pid=/var/run/mysqld/mysqld.pid
 socket=/var/run/mysqld/mysqld.sock \
 op monitor interval=15s timeout=30s \
 op start interval=0 timeout=180s \
 op stop interval=0 timeout=300s \
 meta target-role=Started
 crm(live)configure#group mysql fs_mysql mysqld ip_mysql
 crm(live)configure#location l_mysql_on_01 mysql 100:
 linsrv001.willi-net.local
 crm(live)configure#location mysql-on-connected-node mysql \
 rule $id=mysql-on-connected-node-rule -inf:
 not_defined
 pingd or pingd lte 0
 crm(live)configure#colocation mysql_on_drbd inf: mysql
 

Re: [Linux-HA] problem with pgsql streaming resource agent

2013-07-08 Thread Andreas Kurz
On 2013-07-08 19:40, Jeff Frost wrote:
 We're testing out the pgsql master slave streaming replication resource agent 
 that's found here:
 
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/pgsql
 
 and using the example 2-node configuration found here 
 https://github.com/t-matsuo/resource-agents/wiki/Resource-Agent-for-PostgreSQL-9.1-streaming-replication
 as a template, we came up with the following configuration:
 
 
 node node1
 node node2
 primitive pgsql ocf:heartbeat:pgsql \
   params pgctl=/usr/pgsql-9.2/bin/pg_ctl psql=/usr/pgsql-9.2/bin/psql 
 pgdata=/var/lib/pgsql/9.2/data/ start_opt=-p 5432 rep_mode=async 
 node_list=node1 node2 repuser=replicauser restore_command=rsync -aq 
 /var/lib/pgsql/wal_archive/%f %p master_ip=192.168.253.104 
 stop_escalate=0 \
   op start interval=0s role=Master timeout=60s on-fail=block

Looks like you are missing the monitor operations ... as described in
the example you are referring. In the monitoring operation such
master-slave agents recalculate their master-score and refresh e.g. in
this RA various node-attributes.

And you should follow the described procedures to correctly start-up the
cluster.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Resource Always Tries to Start on the Wrong Node

2013-06-27 Thread Andreas Kurz
Hello Eric,

On 2013-06-27 17:35, Robinson, Eric wrote:
 -Original Message-

 I don't understand why resources try to start on the wrong node (and of
 course fail).

Pacemaker 1.0.7 ... looking at the Changelog of Pacemaker 1.0 at
https://github.com/ClusterLabs/pacemaker-1.0/blob/master/ChangeLog there
are quite some colocation fixes also for sets after that version.

I'd say you hit an already fixed bug.

Regards,
Andreas


 My nodes are ha05 and ha06.

 ha05 is master/primary and all resources are running on it.

 If I run...

 crm resource stop p_MySQL_185

 ..the resource stops fine. Then if I run...

 crm resource start p_MySQL_185

 ..the resource fails to start, and crm_mon shows that it tried to start on 
 the
 wrong node (ha06). Then if I run...

 crm resource cleanup p_MySQL_185

 ..the resource cleans up and starts on node ha05.

 Here is my crm config...

 node ha05.mycharts.md \
 attributes standby=off
 node ha06.mycharts.md
 primitive p_ClusterIP ocf:heartbeat:IPaddr2 \
 params ip=192.168.10.205 cidr_netmask=32 \
 op monitor interval=15s \
 meta target-role=Started
 primitive p_DRBD ocf:linbit:drbd \
 params drbd_resource=ha_mysql \
 op monitor interval=15s
 primitive p_FileSystem ocf:heartbeat:Filesystem \
 params device=/dev/drbd0 directory=/ha_mysql fstype=ext3
 options=noatime
 primitive p_MySQL_000 lsb:mysql_000
 primitive p_MySQL_001 lsb:mysql_001
 primitive p_MySQL_054 lsb:mysql_054
 primitive p_MySQL_057 lsb:mysql_057
 primitive p_MySQL_103 lsb:mysql_103
 primitive p_MySQL_106 lsb:mysql_106
 primitive p_MySQL_139 lsb:mysql_139
 primitive p_MySQL_140 lsb:mysql_140
 primitive p_MySQL_141 lsb:mysql_141
 primitive p_MySQL_142 lsb:mysql_142
 primitive p_MySQL_143 lsb:mysql_143
 primitive p_MySQL_144 lsb:mysql_144
 primitive p_MySQL_145 lsb:mysql_145
 primitive p_MySQL_146 lsb:mysql_146
 primitive p_MySQL_147 lsb:mysql_147
 primitive p_MySQL_148 lsb:mysql_148
 primitive p_MySQL_149 lsb:mysql_149
 primitive p_MySQL_150 lsb:mysql_150
 primitive p_MySQL_151 lsb:mysql_151
 primitive p_MySQL_152 lsb:mysql_152
 primitive p_MySQL_153 lsb:mysql_153
 primitive p_MySQL_154 lsb:mysql_154
 primitive p_MySQL_157 lsb:mysql_157
 primitive p_MySQL_158 lsb:mysql_158
 primitive p_MySQL_160 lsb:mysql_160
 primitive p_MySQL_161 lsb:mysql_161
 primitive p_MySQL_162 lsb:mysql_162
 primitive p_MySQL_163 lsb:mysql_163
 primitive p_MySQL_164 lsb:mysql_164
 primitive p_MySQL_165 lsb:mysql_165
 primitive p_MySQL_167 lsb:mysql_167
 primitive p_MySQL_168 lsb:mysql_168
 primitive p_MySQL_169 lsb:mysql_169
 primitive p_MySQL_170 lsb:mysql_170
 primitive p_MySQL_171 lsb:mysql_171
 primitive p_MySQL_172 lsb:mysql_172
 primitive p_MySQL_173 lsb:mysql_173
 primitive p_MySQL_174 lsb:mysql_174
 primitive p_MySQL_175 lsb:mysql_175
 primitive p_MySQL_176 lsb:mysql_176
 primitive p_MySQL_177 lsb:mysql_177
 primitive p_MySQL_178 lsb:mysql_178
 primitive p_MySQL_179 lsb:mysql_179
 primitive p_MySQL_180 lsb:mysql_180
 primitive p_MySQL_181 lsb:mysql_181
 primitive p_MySQL_182 lsb:mysql_182
 primitive p_MySQL_183 lsb:mysql_183
 primitive p_MySQL_184 lsb:mysql_184
 primitive p_MySQL_185 lsb:mysql_185 \
 meta target-role=Started
 primitive p_MySQL_186 lsb:mysql_186
 primitive p_MySQL_187 lsb:mysql_187
 primitive p_MySQL_188 lsb:mysql_188
 primitive p_MySQL_189 lsb:mysql_189 \
 meta target-role=Started
 primitive p_MySQL_191 lsb:mysql_191
 primitive p_MySQL_192 lsb:mysql_192
 primitive p_MySQL_194 lsb:mysql_194
 primitive p_MySQL_195 lsb:mysql_195
 primitive p_MySQL_196 lsb:mysql_196
 primitive p_MySQL_197 lsb:mysql_197
 primitive p_MySQL_198 lsb:mysql_198
 primitive p_MySQL_199 lsb:mysql_199
 primitive p_MySQL_200 lsb:mysql_200
 primitive p_MySQL_201 lsb:mysql_201
 primitive p_MySQL_202 lsb:mysql_202
 primitive p_MySQL_203 lsb:mysql_203
 primitive p_MySQL_204 lsb:mysql_204
 primitive p_MySQL_205 lsb:mysql_205
 primitive p_MySQL_206 lsb:mysql_206
 primitive p_MySQL_207 lsb:mysql_207
 primitive p_MySQL_208 lsb:mysql_208
 primitive p_MySQL_209 lsb:mysql_209
 primitive p_MySQL_210 lsb:mysql_210
 primitive p_MySQL_211 lsb:mysql_211
 primitive p_MySQL_212 lsb:mysql_212
 primitive p_MySQL_213 lsb:mysql_213
 primitive p_MySQL_214 lsb:mysql_214
 primitive p_MySQL_215 lsb:mysql_215
 primitive p_MySQL_216 lsb:mysql_216
 primitive p_MySQL_217 lsb:mysql_217
 primitive p_MySQL_218 lsb:mysql_218
 primitive p_MySQL_219 lsb:mysql_219
 primitive p_MySQL_220 lsb:mysql_220
 primitive p_MySQL_221 lsb:mysql_221
 primitive p_MySQL_222 lsb:mysql_222
 primitive p_MySQL_224 lsb:mysql_224
 primitive p_MySQL_225 lsb:mysql_225
 primitive p_MySQL_226 lsb:mysql_226
 primitive p_MySQL_227 lsb:mysql_227
 primitive p_MySQL_228 lsb:mysql_228
 primitive p_MySQL_229 lsb:mysql_229
 ms ms_DRBD p_DRBD \
 meta master-max=1 master-node-max=1 clone-max=2 clone-
 node-max=1 notify=true
 colocation c_virtdb03 inf: ( p_MySQL_001 p_MySQL_139 p_MySQL_140
 p_MySQL_141 p_MySQL_142 p_MySQL_143 

Re: [Linux-HA] Fwd: stonith with sbd not working

2013-04-10 Thread Andreas Kurz
On 2013-04-10 17:47, Fredrik Hudner wrote:
 Hi Lars,
 I wouldn't mind to try to install one of the tar balls from
 http://hg.linux-ha.org/sbd, only I'm not sure how to do it after I've
 unzipped/tar it. I saw someone from a discussion group that wanted to do it
 as well..
 If you only tell me how to get it installed (will upgrade pacemaker to
 1.1.8) sbd-1837fd8cc64a that would be great.
 
 I noticed after after I checked a bit closer that all those sbd-common.c,
 sbd.h etc were missing..  (so yeah it was a packaging issue as such).
 And if it won't work.. what options do I have in a VMware environment ? any
 suggestions

ESX? ... on RHEL/Centos the suggestion would be fence_vmware_soap

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 krgds
 /Fredrik
 
 
 On Wed, Apr 10, 2013 at 5:34 PM, Lars Marowsky-Bree l...@suse.com wrote:
 
 On 2013-04-10T15:25:38, Fredrik Hudner fredrik.hud...@gmail.com wrote:

 and removed watchdog from the system but without success..
 Still can't see any references that sbd has started in the messages log

 It looks as if the init script of pacemaker (openais/corosync on SUSE)
 is not taking care to start the sbd daemon.

 I don't think RHT will support sbd on RHEL anyway. I don't think that
 has been tested, sorry :-( This looks like a packaging issue.


 Regards,
 Lars

 --
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
 Imendörffer, HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: manage/umanage

2013-03-27 Thread Andreas Kurz
On 2013-03-27 08:30, Moullé Alain wrote:
 Hi
 Thanks but I never asked to run monitoring on an unmanaged resource  
 ... ? !
 I ask for the opposite : a way to set one resource in a state near to 
 umanage,
 meaning umanaged and wo monitoring, and wo to be forced to set all the 
 cluster-management umanaged with maitenance-mode=true.
 
 I think that, with regards to the responses, this function does not 
 exist ...

You can set a resource to unmanaged and additionally add enabled=false
to the monitor definition.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 Alain
 Le 27/03/2013 08:24, Ulrich Windl a écrit :
 Hi!

 I see little sense to run monitoring on an unmanaged resource, specifically 
 as
 some _monitoring_ operations are not strict read-only, but do change the 
 state
 of a resource (which may be quite unexpected). One example is the RAID RA,
 which tries to re-add missing devices.


 Regards,
 Ulrich

 Moullé Alainalain.mou...@bull.net schrieb am 27.03.2013 um 07:56 in
 Nachricht
 51529820.7050...@bull.net:
 Hi
 OK thanks, but sorry it was not quite the response I was expected as I
 already know
 all that about cleanup, reprobe, etc.  So more clearly my question was :
 Is there a way by crm to invalidate the monitoring temorarily for one
 specific resource ?

 Thanks
 Alain
 Hi,

 On Mon, 25 Mar 2013 16:25:54 +0100 Moullé Alain alain.mou...@bull.net
 wrote:
 I've tested two things :

 1/ if we set maintenance-mode=true :

all the configured ressources become 'unmanaged' , as displayed
 with crm_mon
ok start stop are no more accepted
and it seems that ressources are no more monitored any more by
 pacemaker
 Probably maintainance-mode also tells the cluster-manager to completely
 stop monitoring.

 2/ if we target only one resource via the crm resource umanage
 resname :

it is also displayed unmanage with crm_mon
ok start stop are no more accepted
BUT pacemaker always monitors the resource

 Is there a reason for this difference ?
 Its un-managed, not un-monitored ;-)
 Actually this is not a problem, it will monitor as long as the service
 is up. As the first monitor-action fails, the resource is marked as
 failed and no more monitor action is run. Until you explicitely ask
 for it with cleanup resource or reprobe node.

 Have fun,

 Arnold


 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

   
   
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Many Resources Dependent on One Resource Group

2013-03-25 Thread Andreas Kurz
On 2013-03-24 17:58, Robinson, Eric wrote:
 In the simplest terms, we currently have resources:
 
 A = drbd
 B = filesystem
 C = cluster IP
 D thru J = mysql instances.
 
 Resource group G1 consists of resources B through J, in that order, and is 
 dependent on resource A.
 
 This fails over fine, but it has the serious disadvantage that if you stop or 
 remove a mysql resource in the middle of the list, all of the ones after it 
 stop too. For example, if you stop G, then H thru J stop as well.
 
 We want to change it so that the resource group G1 consists only of resources 
 B  C. All of the mysql instances (D thru J) are individually dependent on 
 group G1, but not dependent on each other. That way you can stop or remove a 
 mysql resource without affecting the others.
 
 I saw this scenario described in the Pacemaker docs, but I cannot find an 
 example of the syntax.

You can use two resource-sets and go without groups, with that crm shell
syntax:

order o_drbd-filesystem-ip-dbs inf: A:promote B C (D E F G H I J)
colocate co_all-follow-drbd inf: (D E F G H I J) B C A:Master

Regards,
Andreas

 
 --
 Eric Robinson
 
 
 Disclaimer - March 24, 2013 
 This email and any files transmitted with it are confidential and intended 
 solely for 'General Linux-HA mailing list'. If you are not the named 
 addressee you should not disseminate, distribute, copy or alter this email. 
 Any views or opinions presented in this email are solely those of the author 
 and might not represent those of Physicians' Managed Care or Physician Select 
 Management. Warning: Although Physicians' Managed Care or Physician Select 
 Management has taken reasonable precautions to ensure no viruses are present 
 in this email, the company cannot accept responsibility for any loss or 
 damage arising from the use of this email or attachments. 
 This disclaimer was added by Policy Patrol: http://www.policypatrol.com/
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Fwd: Problem promoting slave to master

2013-03-20 Thread Andreas Kurz
On 2013-03-20 13:30, Fredrik Hudner wrote:
 I presume you are correct about that. (see drbdadm-dump.txt)
 
 fence-peer   /usr/lib/drbd/crm-fence-peer.sh;
 after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;

after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;

... to remove the constraint, once secondary is in sync again after a
resync run.

Regards,
Andreas

 
 What would I need to do to overwrite it ?
 Or if you have a nicer way to do it.. It's not easy to take over someones
 else configuration always
 
 Kind regards
 /Fredrik
 
 On Tue, Mar 19, 2013 at 11:32 PM, Andreas Kurz andr...@hastexo.com wrote:
 
 On 2013-03-19 16:02, Fredrik Hudner wrote:
 Just wanted to change what document it*s been built from.. It should be
 LINBIT DRBD 8.4 Configuration Guide: NFS on RHEL 6

 There is again that fencing-constraint in your configuration  what
 does drbdadm dump all look like? Any chance you only specified a
 fence-peer handler in you resource configuration but don't overwrite
 that after-resync-target handler you specified in your
 global_common.conf ... that would explain that dangling constraint that
 will prevent a failover.

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 -- Forwarded message --
 From: Fredrik Hudner fredrik.hud...@gmail.com
 Date: Mon, Mar 18, 2013 at 11:06 AM
 Subject: Re: [Linux-HA] Problem promoting slave to master
 To: General Linux-HA mailing list linux-ha@lists.linux-ha.org




 On Fri, Mar 15, 2013 at 1:04 AM, Andreas Kurz andr...@hastexo.com
 wrote:

 On 2013-03-14 15:52, Fredrik Hudner wrote:
 I set no-quorum-policy to ignore and removed the constraint you
 mentioned.
 It then managed to failover once to the slave node, but I still have
 those.

 Failed actions:

  p_exportfs_root:0_monitor_

 3 (node=testclu01, call=12, rc=7,
   status=complete): not running

  p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7,
   status=complete): not running

 This only tells you that monitoring of these resources found them once
 not running  logs should tell you what  when that happens


 I have attached the logs from master and slave.. I can see that it stops,
 but not really why (to limited knowledge to read the logs)



 I then stoped the new maste-node to see if it fell over to the other
 node
 with no success.. It remains slave.

 Hard to say without seeing current cluster state like a crm_mon -1frA,
 cat /proc/drbd and some logs ... not enough information ...

 I have attached the output from crm_mon, the new crm configure and
 /proc/drbd


 I also noticed that the constraint
 drbd-fence-by-handler-nfs-ms_drbd_nfs
 was back in the crm configure. Seems like cib makes a replace

 This constraint is added by the DRBD primary if it looses connection to
 its peer and is perfectly fine if you stopped one node.

 Seems like the cluster have a problem attaching to the cluster node ip,
 but I'm not sure why

 i would like to add, that I took over this configuration from a guy that
 has left, but I know that it's configured by using the technical
 documentation from LINBIT Highly available NFS storage with DRBD and
 Pacemaker.


 Mar 14 15:06:18 [1786] tdtestclu02   crmd: info:
 abort_transition_graph:te_update_diff:126 - Triggered
 transition
 abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) :
 Non-status
 change
 Mar 14 15:06:18 [1786] tdtestclu02   crmd:   notice:
 do_state_transition:   State transition S_IDLE - S_POLICY_ENGINE [
 input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
 Mar 14 15:06:18 [1781] tdtestclu02cib: info:
 cib_replace_notify:Replaced: 0.780.39 - 0.781.1 from tdtestclu01

 So not sure how to remove that constraint on a permanent basis.. it's
 gone
 as long as I don't stop pacemaker.

 Once the DRBD resync is finished it will be removed from the cluster
 configuration again automatically... you typically never need to remove
 such drbd-fence-constraints manually only in some rare failure
 scenarios.

 Regards,
 Andreas



 But it used to work booth with the no-quorom-policy=freeze and that
 constraint

 Kind regards
 /Fredrik



 On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz andr...@hastexo.com
 wrote:

 On 2013-03-14 13:30, Fredrik Hudner wrote:
 Hi all,

 I have a problem after I removed a node with the force command from
 my
 crm
 config.

 Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
 pacemaker 1.1.7-6.el6)



 Then I wanted to add a third node acting as quorum node, but was not
 able
 to get it to work (probably because I don’t understand how to set it
 up).

 So I removed the 3rd node, but had to use the force command as crm
 complained when I tried to remove it.



 Now when I start up Pacemaker the resources doesn’t look like they
 come
 up
 correctly



 Online: [ testclu01 testclu02 ]



 Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]

  Masters: [ testclu01

Re: [Linux-HA] Fwd: Problem promoting slave to master

2013-03-19 Thread Andreas Kurz
On 2013-03-19 16:02, Fredrik Hudner wrote:
 Just wanted to change what document it*s been built from.. It should be
 LINBIT DRBD 8.4 Configuration Guide: NFS on RHEL 6

There is again that fencing-constraint in your configuration  what
does drbdadm dump all look like? Any chance you only specified a
fence-peer handler in you resource configuration but don't overwrite
that after-resync-target handler you specified in your
global_common.conf ... that would explain that dangling constraint that
will prevent a failover.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 -- Forwarded message --
 From: Fredrik Hudner fredrik.hud...@gmail.com
 Date: Mon, Mar 18, 2013 at 11:06 AM
 Subject: Re: [Linux-HA] Problem promoting slave to master
 To: General Linux-HA mailing list linux-ha@lists.linux-ha.org
 
 
 
 
 On Fri, Mar 15, 2013 at 1:04 AM, Andreas Kurz andr...@hastexo.com wrote:
 
 On 2013-03-14 15:52, Fredrik Hudner wrote:
 I set no-quorum-policy to ignore and removed the constraint you
 mentioned.
 It then managed to failover once to the slave node, but I still have
 those.

 Failed actions:

  p_exportfs_root:0_monitor_

 3 (node=testclu01, call=12, rc=7,
   status=complete): not running

  p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7,
   status=complete): not running

 This only tells you that monitoring of these resources found them once
 not running  logs should tell you what  when that happens

 
 I have attached the logs from master and slave.. I can see that it stops,
 but not really why (to limited knowledge to read the logs)
 


 I then stoped the new maste-node to see if it fell over to the other node
 with no success.. It remains slave.

 Hard to say without seeing current cluster state like a crm_mon -1frA,
 cat /proc/drbd and some logs ... not enough information ...

 I have attached the output from crm_mon, the new crm configure and
 /proc/drbd
 
 
 I also noticed that the constraint drbd-fence-by-handler-nfs-ms_drbd_nfs
 was back in the crm configure. Seems like cib makes a replace

 This constraint is added by the DRBD primary if it looses connection to
 its peer and is perfectly fine if you stopped one node.

 Seems like the cluster have a problem attaching to the cluster node ip,
 but I'm not sure why
 
 i would like to add, that I took over this configuration from a guy that
 has left, but I know that it's configured by using the technical
 documentation from LINBIT Highly available NFS storage with DRBD and
 Pacemaker.
 

 Mar 14 15:06:18 [1786] tdtestclu02   crmd: info:
 abort_transition_graph:te_update_diff:126 - Triggered transition
 abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) :
 Non-status
 change
 Mar 14 15:06:18 [1786] tdtestclu02   crmd:   notice:
 do_state_transition:   State transition S_IDLE - S_POLICY_ENGINE [
 input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
 Mar 14 15:06:18 [1781] tdtestclu02cib: info:
 cib_replace_notify:Replaced: 0.780.39 - 0.781.1 from tdtestclu01

 So not sure how to remove that constraint on a permanent basis.. it's
 gone
 as long as I don't stop pacemaker.

 Once the DRBD resync is finished it will be removed from the cluster
 configuration again automatically... you typically never need to remove
 such drbd-fence-constraints manually only in some rare failure scenarios.

 Regards,
 Andreas



 But it used to work booth with the no-quorom-policy=freeze and that
 constraint

 Kind regards
 /Fredrik



 On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz andr...@hastexo.com
 wrote:

 On 2013-03-14 13:30, Fredrik Hudner wrote:
 Hi all,

 I have a problem after I removed a node with the force command from my
 crm
 config.

 Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
 pacemaker 1.1.7-6.el6)



 Then I wanted to add a third node acting as quorum node, but was not
 able
 to get it to work (probably because I don’t understand how to set it
 up).

 So I removed the 3rd node, but had to use the force command as crm
 complained when I tried to remove it.



 Now when I start up Pacemaker the resources doesn’t look like they come
 up
 correctly



 Online: [ testclu01 testclu02 ]



 Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]

  Masters: [ testclu01 ]

  Slaves: [ testclu02 ]

 Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]

  Started: [ tdtestclu01 tdtestclu02 ]

 Resource Group: g_nfs

  p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01

  p_fs_shared(ocf::heartbeat:Filesystem):Started
 testclu01

  p_fs_shared2   (ocf::heartbeat:Filesystem):Started
 testclu01

  p_ip_nfs   (ocf::heartbeat:IPaddr2):   Started testclu01

 Clone Set: cl_exportfs_root [p_exportfs_root]

  Started: [ testclu01 testclu02 ]



 Failed actions:

 p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7,
 status=complete): not running

Re: [Linux-HA] Need some help for corosync/pacemaker with slapd master/slave

2013-03-19 Thread Andreas Kurz
On 2013-03-18 16:53, guilla...@cheramy.name wrote:
 Hello,
 
   I'll try for a week to make a cluster with 2 servers for slapd HA.
 ldap01 is the master slapd server, ldap02 is a replica server. All is ok
 with Debian and /etc/ldap/slapd.conf configuration.
 
   So now I wants if ldap01 fail that ldap02 became master for continue to
 add data in ldap database. And if ldap01 failback I wants it became
 slave of ldap02 for resync data.
 
   So I'll try this configuration with pacemaker multi-state with this
 configuration :
 
 # crm configure show
 node ldap01 \
   attributes standby=off
 node ldap02 \
   attributes standby=off
 primitive LDAP-SERVER ocf:custom:slapd-ms \
   op monitor interval=30s
 primitive VIP-1 ocf:heartbeat:IPaddr2 \
   params ip=10.0.2.30 broadcast=10.0.2.255 nic=eth0
 cidr_netmask=24 iflabel=VIP1 \
   op monitor interval=30s timeout=20s
 ms MS-LDAP-SERVER LDAP-SERVER \
   params config=/etc/ldap/slapd.conf lsb_script=/etc/init.d/slapd \
   meta clone-max=2 clone-node-max=1 master-max=1
 master-node-max=1 notify=false target-role=Master
 colocation LDAP-WITH-IP inf: VIP-1 MS-LDAP-SERVER
 order LDAP-AFTER-IP inf: VIP-1 MS-LDAP-SERVER
 property $id=cib-bootstrap-options \
   dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \
   cluster-infrastructure=openais \
   expected-quorum-votes=2 \
   stonith-enabled=false \
   no-quorum-policy=ignore \
   last-lrm-refresh=1363619724
 
 I use a ocf slapd-ms script provided by this article :
 http://foaa.de/old-blog/2010/10/intro-to-pacemaker-part-2-advanced-topics/trackback/index.html#master-slave-primus-inter-pares
 

interesting setup and resource agent  you should really go with
the well tested slapd resource agent that comes with the resource-agent
package, use slapd as a clone resource and setup slapd to run in
mirror-mode.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 So my problem is that all the deux nodes start as Slaves for
 MS-LDAP-SERVER :
 
 Last updated: Mon Mar 18 16:53:05 2013
 Last change: Mon Mar 18 16:53:01 2013 via crmd on ldap02
 Stack: openais
 Current DC: ldap01 - partition with quorum
 Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
 2 Nodes configured, 2 expected votes
 3 Resources configured.
 
 
 Online: [ ldap01 ldap02 ]
 
 VIP-1   (ocf::heartbeat:IPaddr2):   Started ldap01
  Master/Slave Set: MS-LDAP-SERVER [LDAP-SERVER]
  Slaves: [ ldap01 ldap02 ]
 
 Could somone helps me ?
 
 Thanks
 



___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem promoting slave to master

2013-03-14 Thread Andreas Kurz
On 2013-03-14 13:30, Fredrik Hudner wrote:
 Hi all,
 
 I have a problem after I removed a node with the force command from my crm
 config.
 
 Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
 pacemaker 1.1.7-6.el6)
 
 
 
 Then I wanted to add a third node acting as quorum node, but was not able
 to get it to work (probably because I don’t understand how to set it up).
 
 So I removed the 3rd node, but had to use the force command as crm
 complained when I tried to remove it.
 
 
 
 Now when I start up Pacemaker the resources doesn’t look like they come up
 correctly
 
 
 
 Online: [ testclu01 testclu02 ]
 
 
 
 Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]
 
  Masters: [ testclu01 ]
 
  Slaves: [ testclu02 ]
 
 Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]
 
  Started: [ tdtestclu01 tdtestclu02 ]
 
 Resource Group: g_nfs
 
  p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01
 
  p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01
 
  p_fs_shared2   (ocf::heartbeat:Filesystem):Started testclu01
 
  p_ip_nfs   (ocf::heartbeat:IPaddr2):   Started testclu01
 
 Clone Set: cl_exportfs_root [p_exportfs_root]
 
  Started: [ testclu01 testclu02 ]
 
 
 
 Failed actions:
 
 p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7,
 status=complete): not running
 
 p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7,
 status=complete): not running
 
 
 
 The filesystems mount correctly on the master at this stage and can be
 written to.
 
 When I stop the services on the master node for it to failover, it doesn’t
 work.. Looses cluster-ip connectivity

fix your no-quorum-policy, you want to ignore the quorum in a
two-node cluster to allow failover ... and if your drbd device is
already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs
constraint.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 
 Corosync.log from master after I stopped pacemaker on master node :  see
 attached file
 
 
 
 Additional files (attached): crm-configure show
 
   Corosync.conf
 
   Global_common.conf
 
 
 
 
 
 I’m not sure how to proceed to get it up in a fair state now
 
 So if anyone could help me it would be much appreciated
 
 
 
 Kind regards
 
 /Fredrik
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem promoting slave to master

2013-03-14 Thread Andreas Kurz
On 2013-03-14 15:52, Fredrik Hudner wrote:
 I set no-quorum-policy to ignore and removed the constraint you mentioned.
 It then managed to failover once to the slave node, but I still have those.
 
 Failed actions:
 
  p_exportfs_root:0_monitor_

 3 (node=testclu01, call=12, rc=7,
   status=complete): not running

  p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7,
   status=complete): not running

This only tells you that monitoring of these resources found them once
not running  logs should tell you what  when that happens

 
 I then stoped the new maste-node to see if it fell over to the other node
 with no success.. It remains slave.

Hard to say without seeing current cluster state like a crm_mon -1frA,
cat /proc/drbd and some logs ... not enough information ...

 I also noticed that the constraint drbd-fence-by-handler-nfs-ms_drbd_nfs
 was back in the crm configure. Seems like cib makes a replace

This constraint is added by the DRBD primary if it looses connection to
its peer and is perfectly fine if you stopped one node.


 Mar 14 15:06:18 [1786] tdtestclu02   crmd: info:
 abort_transition_graph:te_update_diff:126 - Triggered transition
 abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.781.1) : Non-status
 change
 Mar 14 15:06:18 [1786] tdtestclu02   crmd:   notice:
 do_state_transition:   State transition S_IDLE - S_POLICY_ENGINE [
 input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
 Mar 14 15:06:18 [1781] tdtestclu02cib: info:
 cib_replace_notify:Replaced: 0.780.39 - 0.781.1 from tdtestclu01
 
 So not sure how to remove that constraint on a permanent basis.. it's gone
 as long as I don't stop pacemaker.

Once the DRBD resync is finished it will be removed from the cluster
configuration again automatically... you typically never need to remove
such drbd-fence-constraints manually only in some rare failure scenarios.

Regards,
Andreas


 
 But it used to work booth with the no-quorom-policy=freeze and that
 constraint
 
 Kind regards
 /Fredrik
 
 
 
 On Thu, Mar 14, 2013 at 2:49 PM, Andreas Kurz andr...@hastexo.com wrote:
 
 On 2013-03-14 13:30, Fredrik Hudner wrote:
 Hi all,

 I have a problem after I removed a node with the force command from my
 crm
 config.

 Originally I had 2 nodes running HA cluster (corosync 1.4.1-7.el6,
 pacemaker 1.1.7-6.el6)



 Then I wanted to add a third node acting as quorum node, but was not able
 to get it to work (probably because I don’t understand how to set it up).

 So I removed the 3rd node, but had to use the force command as crm
 complained when I tried to remove it.



 Now when I start up Pacemaker the resources doesn’t look like they come
 up
 correctly



 Online: [ testclu01 testclu02 ]



 Master/Slave Set: ms_drbd_nfs [p_drbd_nfs]

  Masters: [ testclu01 ]

  Slaves: [ testclu02 ]

 Clone Set: cl_lsb_nfsserver [p_lsb_nfsserver]

  Started: [ tdtestclu01 tdtestclu02 ]

 Resource Group: g_nfs

  p_lvm_nfs  (ocf::heartbeat:LVM):   Started testclu01

  p_fs_shared(ocf::heartbeat:Filesystem):Started testclu01

  p_fs_shared2   (ocf::heartbeat:Filesystem):Started testclu01

  p_ip_nfs   (ocf::heartbeat:IPaddr2):   Started testclu01

 Clone Set: cl_exportfs_root [p_exportfs_root]

  Started: [ testclu01 testclu02 ]



 Failed actions:

 p_exportfs_root:0_monitor_3 (node=testclu01, call=12, rc=7,
 status=complete): not running

 p_exportfs_root:1_monitor_3 (node=testclu02, call=12, rc=7,
 status=complete): not running



 The filesystems mount correctly on the master at this stage and can be
 written to.

 When I stop the services on the master node for it to failover, it
 doesn’t
 work.. Looses cluster-ip connectivity

 fix your no-quorum-policy, you want to ignore the quorum in a
 two-node cluster to allow failover ... and if your drbd device is
 already in sync, remove that drbd-fence-by-handler-nfs-ms_drbd_nfs
 constraint.

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now




 Corosync.log from master after I stopped pacemaker on master node :  see
 attached file



 Additional files (attached): crm-configure show

   Corosync.conf


 Global_common.conf





 I’m not sure how to proceed to get it up in a fair state now

 So if anyone could help me it would be much appreciated



 Kind regards

 /Fredrik



 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems



 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org

Re: [Linux-HA] action monitor not advertised in meta-data

2013-02-28 Thread Andreas Kurz
On 2013-02-06 15:32, Mario Linick wrote:
 Hi everyone,
 
 i have  problems with a warn-message when configure a drbd-ra.
 
 clusterinfo:
 nodes: 2
 os: sles11sp2 + sleha
 drbd-version: 8.41
 pacemaker: 1.1.7
 corosync: 1.4.3
 
 
 I'm stuck trying to add the DRBD resources. Specifically, whenever I try to 
 configure my DRBD resources I get the following:
 input are:
 
 crm_configure primitive drbd_r0 ocf:linbit:drbd params drbd_resource=r0 op 
 monitor  interval=15
 
 the output is:
 
 WARNING: drbd_r0: action monitor not advertised in meta-data, it may not be 
 supported by the RA
 
 my question: 
 
 what did I do wrong (i don't understand the follows of this warning )?

The DRBD RA only advertises monitor actions for Master  Slave role. Try:

crm_configure primitive drbd_r0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=30s role=Slave  \
op monitor interval=29s role=Master

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 can anyone help?
 
 Thanks in advance,
 
 Mario
 
 
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Node UNCLEAN (online)

2012-12-13 Thread Andreas Kurz
On 12/13/2012 10:00 AM, Josh Bowling wrote:
 I have a two node cluster running Ubuntu 12.04 and had everything working
 just fine (STONITH, failover, etc.) until I changed up the virtual machines
 and the crm configuration to match (my virtual machines get booted by
 pacemaker on the survivor node when a failover occurs).  The primary node
 currently has a status of UNCLEAN (online) as it tried to boot a VM that
 no longer existed - had changed the VMs but not the crm configuration at
 this point.  I have since modified the configuration and synced data with
 DRBD so everything is good to go except for pacemaker.

So the second node was offline? ... as it did not fence the primary node?

 
 Is there a way to remove the error and set the UNCLEAN node to just
 online?  I think since it's currently seen as unclean, the new
 configuration won't propagate to the secondary node.  In order to make
 ensure high availability, both machines need to be clean, online, and have
 the same crm configuration.

Assuming the second node does not run pacemaker  switch the cluster
into maintenance-mode, adjust your crm configuration to reflect your
changes to the VMs and restart pacemaker.

Once you made sure all is in place and all vm configurations are
successfully probed disable maintenance-mode and start pacemaker on the
second node.

 
 I'm hoping there's a quick way to get this cluster back on track.  I doubt
 one of my servers are going to fail any time soon, but you never know.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Thanks in advance,
 
 Josh
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] how to defined a service start on a node

2012-12-05 Thread Andreas Kurz
On 12/04/2012 08:56 PM, Emmanuel Saint-Joanis wrote:
 This setup might do the trick :
 
 primitive srv-mysql lsb:mysql \
 op monitor interval=120 \
 op start interval=0 timeout=60 on-fail=restart \
 op stop interval=0 timeout=60s on-fail=ignore
 
 primitive srv-websphere lsb:websphere \
 op monitor interval=120 \
 op start interval=0 timeout=60 on-fail=restart \
 op stop interval=0 timeout=60s on-fail=ignore
 
 ms ms-drbd-data drbd-data \
 meta master-max=1 master-node-max=1 clone-max=2 notify=true
 target-role=Master
 
 colocation mysql-only-slave -inf: srv-mysql ms-drbd-data:Master

with a score of -inf this would prevent to run srv-mysql on the same
node forever ... even in case of a node failure ... using a negative
but not -inf score should also allow them to run together in case of
node failures.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 colocation websphere-only-master inf: srv-websphere ms-drbd-data:Master
 
 The keypoint is to ensure each service runs on different nodes
 After that you must configure the orders.
 
 
 2012/12/4 alonerhu alone...@gmail.com
 
 I have two machines with corosync+pacemaker, and I want to run mysql +
 websphere, how can I defined mysql start on node1 and websphere start on
 node2?

 thanks.

 I use drbd for data sync.



 alonerhu via foxmail
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 







signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] master/slave drbd resource STILL will not failover

2012-12-05 Thread Andreas Kurz
On 12/05/2012 09:31 PM, Robinson, Eric wrote:
 you could probably find the stop action in the 
 RA and replace it with (e.g.) logger 'AIE ***I did not 
 want this***' and then see what gets logged.

 --
 
 Well, that worked, in the sense that the resource now fails over. I replaced 
 the start and stop actions in the RA with logger commands. Now when I do 'crm 
 node standby' on the primary, I get the following in the messages log (since 
 there are two drbd resources):
 
 Dec  5 12:23:49 ha09b root: STOP action disabled
 Dec  5 12:23:51 ha09b root: STOP action disabled
 Dec  5 12:24:22 ha09b root: START action disabled
 Dec  5 12:24:25 ha09b root: START action disabled
 
 The resource then fails over, though it takes maybe 30 seconds to complete. I 
 confirmed that /proc/drbd now shows the correct status on both nodes. I was 
 able to repeat this back and forth a few times just to be sure. 
 
 When one node is offline, crm_mon shows that the resources are stopped (which 
 they actaully are NOT).
 
 Sigh. The RA is clearly not working right, but I don't know if that is the 
 root cause of the failover problems or just a symptom of it.
 
 Now what?

I'd go with DRBD 8.3.14, there are precompiled RPMs available e.g. from
elrepo (testing) ... and I found this thread regarding DRBD  Pacemaker
1.1.8  crm-fence-peer.sh is not working correctly without some
modifications:

http://www.gossamer-threads.com/lists/drbd/users/24550#24550

hth  Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 --Eric
 
 
 
 Disclaimer - December 5, 2012 
 This email and any files transmitted with it are confidential and intended 
 solely for General Linux-HA mailing list. If you are not the named addressee 
 you should not disseminate, distribute, copy or alter this email. Any views 
 or opinions presented in this email are solely those of the author and might 
 not represent those of Physicians' Managed Care or Physician Select 
 Management. Warning: Although Physicians' Managed Care or Physician Select 
 Management has taken reasonable precautions to ensure no viruses are present 
 in this email, the company cannot accept responsibility for any loss or 
 damage arising from the use of this email or attachments. 
 This disclaimer was added by Policy Patrol: http://www.policypatrol.com/
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] How to configure resources on pacemaker?

2012-12-05 Thread Andreas Kurz
On 12/05/2012 10:49 PM, Felipe Gutierrez wrote:
 Hi,
 
 I configured wrong my pacemaker. I have resources that are wrong. So I need
 to delete them and configure again.
 Does any one know how to remove resources and how to configure them
 correctly?

to clean your complete configuration and start-over with an empty one,
you can use: cibadmin --force --erase

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 I tryed to remove but the system acuse they are running, even if I stop them
 # crm resource stop vm8-DRBD
 # crm configure
 crm(live)configure# delete vm8-DRBD
 ERROR: resource vm8-DRBD is running, can't delete it
 
 # crm_mon --one-shot -V
 crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_resources: Resource
 start-up disabled since no STONITH resources have been defined
 crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_resources: Either
 configure some or disable STONITH with the stonith-enabled option
 crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_resources: NOTE: Clusters
 with shared data need STONITH to ensure data integrity
 crm_mon[28783]: 2012/12/05_19:42:56 ERROR: unpack_rsc_op: Hard error -
 Cluster-FS-Mount_last_failure_0 failed with rc=6: Preventing
 Cluster-FS-Mount from re-starting anywhere in the cluster
 
 Last updated: Wed Dec  5 19:42:56 2012
 Last change: Wed Dec  5 18:44:16 2012 via crmd on cloud8
 Stack: Heartbeat
 Current DC: cloud8 (949237ab-9f7d-47d1-b4ad-39e4583d8f0d) - partition with
 quorum
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 2 Nodes configured, unknown expected votes
 3 Resources configured.
 
 
 Node cloud8 (949237ab-9f7d-47d1-b4ad-39e4583d8f0d): UNCLEAN (online)
 Node cloud10 (6f2a6b44-00c1-4ef2-b936-534a21f3dc45): UNCLEAN (online)
 
 
 Failed actions:
 vm8_start_0 (node=cloud8, call=14, rc=5, status=complete): not installed
 vm8-DRBD:1_stop_0 (node=cloud8, call=21, rc=5, status=complete): not
 installed
 Cluster-FS-Mount_stop_0 (node=cloud8, call=11, rc=6, status=complete):
 not configured
 vm8_start_0 (node=cloud10, call=8, rc=1, status=complete): unknown error
 vm8-DRBD:0_stop_0 (node=cloud10, call=18, rc=5, status=complete): not
 installed
 
 Thanks in advance,
 Felipe
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pacemaker symmetric-cluster false problem.

2012-11-12 Thread Andreas Kurz
On 11/12/2012 05:49 PM, Rafał Radecki wrote:
 Hi all.
 
 I have a cluster of 4 nodes:
 - lb1.local, lb2.local: pound, varnish, memcache, nginx;
 - storage1, storage2: no primitives/resources yet.
 I want to set up an opt-in cluster
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch06s02s02.html.
 My config is:
 
 node lb1.local
 node lb2.local
 node storage1
 node storage2
 primitive LBMemcached lsb:memcached \
 meta target-role=Started
 primitive LBPound lsb:pound
 primitive LBVIP ocf:heartbeat:IPaddr2 \
 params ip=10.0.2.200 cidr_netmask=32 \
 op monitor interval=30s \
 meta priority=100 target-role=Started is-managed=true
 primitive LBVarnish lsb:varnish \
 meta target-role=Started
 primitive MCVIP ocf:heartbeat:IPaddr2 \
 params ip=192.168.100.200 cidr_netmask=32 \
 op monitor interval=30s \
 meta target-role=Started
 location LBMemcached_not_storage1 LBMemcached -inf: storage1
 location LBMemcached_not_storage2 LBMemcached -inf: storage2
 location LBMemcached_prefer_lb1 LBMemcached 100: lb1.local
 location LBMemcached_prefer_lb2 LBMemcached 200: lb2.local
 location LBPound_not_storage1 LBPound -inf: storage1
 location LBPound_not_storage2 LBPound -inf: storage2
 location LBPound_prefer_lb1 LBPound 200: lb1.local
 location LBPound_prefer_lb2 LBPound 100: lb2.local
 location LBVIP_not_storage1 LBVIP -inf: storage1
 location LBVIP_not_storage2 LBVIP -inf: storage2

you are missing an explicit score for LBVIP on lb1/lb2 

 location LBVarnish_not_storage1 LBVarnish -inf: storage1
 location LBVarnish_not_storage2 LBVarnish -inf: storage2
 location LBVarnish_prefer_lb1 LBVarnish 200: lb1.local
 location LBVarnish_prefer_lb2 LBVarnish 200: lb2.local
 location MCVIP_not_storage1 MCVIP -inf: storage1
 location MCVIP_not_storage2 MCVIP -inf: storage2

and another missing explicit score for MCVIP on lb1/lb2

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 colocation LBMemcached_with_MCVIP inf: LBMemcached MCVIP
 colocation LBPound_with_LBVIP inf: LBPound LBVIP
 colocation LBVarnish_with_LBVIP inf: LBVarnish LBVIP
 order LBMemcached_after_MCVIP inf: MCVIP LBMemcached
 order LBPound_after_LBVIP inf: LBVIP LBPound
 order LBVarnish_after_LBVIP inf: LBVIP LBVarnish
 property $id=cib-bootstrap-options \
 dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \
 cluster-infrastructure=openais \
 expected-quorum-votes=4 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 symmetric-cluster=true \
 default-resource-stickiness=1 \
 last-lrm-refresh=1352476209
 rsc_defaults $id=rsc-options \
 resource-stickiness=10 \
 migration-threshold=100
 
 The problem is that when I set
 crm_attribute --attr-name symmetric-cluster --attr-value true
 crm_mon:
 
 Last updated: Mon Nov 12 17:48:14 2012
 Last change: Fri Nov  9 17:39:24 2012 via crm_attribute on storage1
 Stack: openais
 Current DC: storage2 - partition with quorum
 Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
 4 Nodes configured, 4 expected votes
 5 Resources configured.
 
 
 Online: [ lb1.local lb2.local storage1 storage2 ]
 
 LBVIP   (ocf::heartbeat:IPaddr2):   Started lb1.local
 LBPound (lsb:pound):Started lb1.local
 LBVarnish   (lsb:varnish):  Started lb1.local
 MCVIP   (ocf::heartbeat:IPaddr2):   Started lb2.local
 LBMemcached (lsb:memcached):Started lb2.loca
 
 but when I set
 crm_attribute --attr-name symmetric-cluster --attr-value false
 crm_mon:
 
 Last updated: Mon Nov 12 17:48:43 2012
 Last change: Fri Nov  9 17:39:55 2012 via crm_attribute on storage1
 Stack: openais
 Current DC: storage2 - partition with quorum
 Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
 4 Nodes configured, 4 expected votes
 5 Resources configured.
 
 
 Online: [ lb1.local lb2.local storage1 storage2 ]
 
 As far as I can see I have proper priorities set for my
 primitives/resources. Any clues?
 
 Best regards,
 Rafal Radecki.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 







signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] STONITH for Amazon EC2 - fence_ec2

2012-10-22 Thread Andreas Kurz
On 10/16/2012 05:16 PM, Borja García de Vinuesa wrote:
 Hi again!
 
 As I said, now pacemaker takes fence_ec2. However, it's not working as 
 expected. This is what I find when starting pacemaker on all nodes (online 
 ha1 and ha2 and ha3 in standby):
 
 --
 
 Last updated: Tue Oct 16 17:11:47 2012
 Last change: Tue Oct 16 16:57:46 2012 via cibadmin on ha1
 Stack: openais
 Current DC: ha2 - partition with quorum
 Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
 3 Nodes configured, 3 expected votes
 6 Resources configured.
 
 
 Node ha3: standby
 Online: [ ha1 ha2 ]
 
 Master/Slave Set: ms_drbd_r0 [p_drbd_r0]
  Masters: [ ha1 ]
  Slaves: [ ha2 ]
 Resource Group: g_r0
  p_fs_r0(ocf::heartbeat:Filesystem):Started ha1
  p_ip_r0(ocf::heartbeat:IPaddr2):   Started ha1
  p_r0   (ocf::heartbeat:mysql): Started ha1
 
 Failed actions:
 stonith_my-ec2-nodes_start_0 (node=ha2, call=11, rc=1, status=complete): 
 unknown error
 stonith_my-ec2-nodes_start_0 (node=ha1, call=7, rc=1, status=complete): 
 unknown error
 
 -
 
 
 CONFIGURATION
 
 [root@ha1 ~]# crm configure show
 node ha1
 node ha2
 node ha3 \
 attributes standby=on
 primitive p_drbd_r0 ocf:linbit:drbd \
 params drbd_resource=r0 \
 op monitor interval=15s
 primitive p_fs_r0 ocf:heartbeat:Filesystem \
 params device=/dev/drbd0 directory=/var/lib/mysql_drbd 
 fstype=ext4
 primitive p_ip_r0 ocf:heartbeat:IPaddr2 \
 params ip=54.X.X.X  cidr_netmask=32 nic=eth0
 primitive p_r0 ocf:heartbeat:mysql \
 params binary=/usr/bin/mysqld_safe config=/etc/my.cnf 
 datadir=/var/lib/mysql_drbd/mysql_data pid=/var/run/mysqld/mysqld.pid 
 socket=/var/lib/mysql/mysql.sock user=root group=root 
 additional_parameters=--bind-address=54.X.X.X --user=root \
 op start interval=0 timeout=120s \
 op stop interval=0 timeout=120s \
 op monitor interval=20s timeout=30s \
 meta is-managed=true
 primitive stonith_my-ec2-nodes stonith:fence_ec2 \
 params ec2-home=$EC2_HOME pcmk_host_check=static-list 
 pcmk_host_list=ha1 ha2 ha3 \

I'd say you are missing that $EC2_HOME in the env ... but logs should
tell you. I used full path here and it works for me.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 op monitor interval=600s timeout=300s \
 op start start-delay=30s interval=0
 group g_r0 p_fs_r0 p_ip_r0 p_r0
 ms ms_drbd_r0 p_drbd_r0 \
 meta master-max=1 master-node-max=1 clone-max=2 
 clone-node-max=1 notify=true
 location cli-prefer-g_r0 g_r0 \
 rule $id=cli-prefer-rule-g_r0 inf: #uname eq ha1
 location cli-prefer-p_r0 p_r0 \
 rule $id=cli-prefer-rule-p_r0 inf: #uname eq ha1
 colocation c_r0_on_drbd inf: g_r0 ms_drbd_r0:Master
 order o_drbd_before_mysql inf: ms_drbd_r0:promote g_r0:start
 property $id=cib-bootstrap-options \
 dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \
 cluster-infrastructure=openais \
 expected-quorum-votes=3 \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 stonith-action=reboot \
 last-lrm-refresh=1349707463
 rsc_defaults $id=rsc-options \
 resource-stickiness=100
 
 With EC2_HOME=/tmp/ec2-api-tools-1.6.4
 
 Any ideas on which could be the problem here?
 
 Thanks in advance!
 
 Borja Gª de Vinuesa O.
 Communications Administrator
 Skype: borja.garcia.actualize
 
 
 [Descripción: LogoActualizeRGB] http://www.actualize.es/
 
 
 Avda. Valdelaparra, 27
 P. E. Neisa Norte Edif. 1, 3ªpl.
 28108 Alcobendas Madrid- España
 Tel. +34 91 799 40 70
 Fax +34 91 799 40 79
 www.actualize.eshttp://www.actualize.es/
 
 
 
 
 ESPAÑA * BRASIL * CHILE * COLOMBIA * MEXICO * UK
 
 
 Aviso sobre confidencialidad
 Este documento se dirige exclusivamente a su destinatario y puede contener 
 información confidencial o cuya divulgación debe estar autorizada en virtud 
 de la legislación vigente. Se informa a quien lo recibiera sin ser el 
 destinatario o persona autorizada por éste, que la información contenida en 
 el mismo es reservada y su utilización o divulgación con cualquier fin está 
 prohibida. Si ha recibido este documento por error, le rogamos que nos lo 
 comunique inmediatamente por esta misma vía o por teléfono, y proceda a su 
 destrucción.
 
 Confidentiality and Privacy: If you have received this email in error, please 
 notify the sender and delete it as well as any attachments. The e-mail and 
 any attached files are intended only for the use of the person or 
 organisation to whom they are addressed. It is prohibited and may be unlawful 
 to use copy or disclose these documents unless authorised to do so. We may 
 need to monitor emails we send and 

Re: [Linux-HA] Stickiness confusion

2012-10-10 Thread Andreas Kurz
On 10/11/2012 12:43 AM, Kevin F. La Barre wrote:
 I'm testing stickiness in a sandbox that consists of 3 nodes.  The
 configuration is very simple but it's not acting the way I think it should.
 
 
 My configuration:
 
 # crm configure show
 node hasb1
 node hasb2
 node hasb3
 primitive postfix lsb:postfix \
 op monitor interval=15s
 property $id=cib-bootstrap-options \
 dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \
 cluster-infrastructure=openais \
 expected-quorum-votes=3 \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 last-lrm-refresh=1349902760 \
 maintenance-mode=false \
 is-managed-default=true
 rsc_defaults $id=rsc-options \
 resource-stickiness=100
 
 
 
 The test resource postfix lives on hasb1.
 
 # crm_simulate -sL
 
 Current cluster status:
 Online: [ hasb1 hasb3 hasb2 ]
 
  postfix(lsb:postfix):  Started hasb1
 
 Allocation scores:
 native_color: postfix allocation score on hasb1: 100
 native_color: postfix allocation score on hasb2: 0
 native_color: postfix allocation score on hasb3: 0
 
 
 On hasb1 I'll kill the corosync process.  Resource moves over to hasb2 as
 expected.

So cluster processes are killed and the resource keeps running on hasb1
... and starts a second time on hasb2 as hasb1 is still running and you
have no stonith ...

 
 # crm status
 
 Last updated: Wed Oct 10 22:35:23 2012
 Last change: Wed Oct 10 21:30:12 2012 via crm_resource on hasb2
 Stack: openais
 Current DC: hasb2 - partition with quorum
 Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
 3 Nodes configured, 3 expected votes
 1 Resources configured.
 
 
 Online: [ hasb3 hasb2 ]
 OFFLINE: [ hasb1 ]
 
  postfix(lsb:postfix):  Started hasb2
 
 
 # crm_simulate -sL
 
 Current cluster status:
 Online: [ hasb3 hasb2 ]
 OFFLINE: [ hasb1 ]
 
  postfix(lsb:postfix):  Started hasb2
 
 Allocation scores:
 native_color: postfix allocation score on hasb1: 0
 native_color: postfix allocation score on hasb2: 100
 native_color: postfix allocation score on hasb3: 0
 
 
 Now I'll start corosync  pacemaker.  Postfix resource moves back to hasb1
 even though we have default stickiness.

in your logs you will see the cluster detecting postfix running twice
and do a stop all/start one by default ... really stop/reset a server if
you want to test node failures, like:

echo b /proc/sysrq-trigger

... and use stonith!

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 # crm status
 
 Last updated: Wed Oct 10 22:37:00 2012
 Last change: Wed Oct 10 21:30:12 2012 via crm_resource on hasb2
 Stack: openais
 Current DC: hasb2 - partition with quorum
 Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
 3 Nodes configured, 3 expected votes
 1 Resources configured.
 
 
 Online: [ hasb1 hasb3 hasb2 ]
 
  postfix(lsb:postfix):  Started hasb1
 
 
 # crm_simulate -sL
 
 Current cluster status:
 Online: [ hasb1 hasb3 hasb2 ]
 
  postfix(lsb:postfix):  Started hasb1
 
 Allocation scores:
 native_color: postfix allocation score on hasb1: 100
 native_color: postfix allocation score on hasb2: 0
 native_color: postfix allocation score on hasb3: 0
 
 
 What am I missing?  I'm pulling my hair - any help would be appreciated
 greatly.
 
 Corosync 1.4.1
 Pacemaker 1.1.7
 CentOS 6.2
 
 
 -Kevin
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] STONITH for Amazon EC2 - fence_ec2

2012-10-08 Thread Andreas Kurz
On 10/06/2012 08:45 AM, Kevin F. La Barre wrote:
 I'm trying to get the fence_ec2 agent (link below) working and a bit
 confused on how it should be configured.  I have modified the agent with
 the EC2 key and cert, region, etc.  The part of confused about is the
 port argument and how it's supposed to work.  Am I supposed to hardcode
 the uname into the port variable or is this somehow passed into the
 script as an argument?   If I hardcode it, I don't understand how Pacemaker
 passes on the information as to which node to kill.  Versions and config.
 details follow.
 
 I apologize if this has been vague.  Please let me know if you need more
 information.
 
 Fencing agent:
 https://github.com/beekhof/fence_ec2/blob/392a146b232fbf2bf2f75605b1e92baef4be4a01/fence_ec2
 
 crm configure primitive ec2-fencing stonith::fence_ec2 \
 params action=reboot \
 op monitor interval=60s

try something like:

primitive stonith_my-ec2-nodes stonith:fence_ec2 \
params ec2-home=/root/.ec2 pcmk_host_check=static-list
pcmk_host_list=myec2-01 myec2-02 \
op monitor interval=600s timeout=300s \
op start start-delay=30s interval=0


... where the nodenames are sent as port paramter.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Corosync v1.4.1
 Pacemaker v1.1.7
 CentOS 6.2
 
 -Kevin
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat not starting when both nodes are down

2012-10-08 Thread Andreas Kurz
On 10/08/2012 09:42 PM, Nicolás wrote:
 El 28/09/2012 20:42, Nicolás escribió:
 Hi all!

 I'm new to this list, I've been looking to get some info about this but
 I haven't seen anything, so I'm trying this way.

 I've successfully configured a 2-node cluster with DRBD + Heartbeat +
 Pacemaker. It works as expected.

 The problem comes when both nodes are down. Having this, after powering
 on one of the nodes, I can see it configuring the network but after this
 I never see the console for this machine. So I try to connect via SSH
 and realize that Heartbeat is not running. After I run it manually I can
 see the console for this node. This only happens when BOTH nodes are
 down. When just one is, everything goes right as Heartbeat starts
 automatically on the powering-on node.

 I see nothing relevant in logs, my conf is as follows:

 root@cluster1:~# cat /etc/ha.d/ha.cf | grep -e ^[^#]
 logfacility local0
 ucast eth1 192.168.0.91
 ucast eth0 192.168.20.51
 auto_failback on
 nodecluster1.gamez.es cluster2.gamez.es
 use_logd yes
 crm  on
 autojoin none

 Any ideas on what am I doing wrong?

Looks like enabled DRBD init script with default startup-timeout
parameters ... that script blocks until peer is connected or timeout --
default forever (depending on some configuration parameters) or manual
confirmation on console ... as heartbeat is typically last in boot
process it is not (yet) started.

For a new cluster use Corosync and not Heartbeat,disable DRBD init
script and configure it as a Pacemaker master-slave resource.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 Thanks a lot in advance.

 Nicolás
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 Any ideas with this?
 
 Thanks!
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 







signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] STONITH for Amazon EC2 - fence_ec2

2012-10-08 Thread Andreas Kurz
On 10/08/2012 10:44 PM, Kevin F. La Barre wrote:
 Thank you Andreas, your input does further my understanding of how this is
 supposed to work; however, I'm still unclear about your statement where
 the nodenames are sent as port parameter.  Specifically, how is the port
 parameter passed from Pacemaker to the fencing script?  Should port be
 specified in the primitive declaration, and if so, since it's dynamic we
 cannot simply set it to a static node-name, right?  Also, if you look at
 the code for fence_ec2 provided in the link within my original email the
 port variable is essentially empty, port=.  If we leave this empty the
 script will never receive the node name, but we cannot statically populate
 the variable.  I would think that node A or node B would be the ones
 fencing Node C.  Node C after-all may not be able to communicate.

Pacemaker also sends the cluster node name to fence to the stonith
agent, that is interpreted as port. I am sure you read the description
about how to use this agent and how it tries to find the correct EC2
instance name to fence, using the node name.

Regards,
Andreas


 
 The above comes from the fact that the node/port is not being passed in my
 configuration for some reason.  I'm seeing errors to the effect that
 INSTANCE (the ec2 instance, aka port) is not being specified.
 
 Any help is much appreciated!
 
 
 Respectfully,
 Kevin
 
 
 On Mon, Oct 8, 2012 at 2:10 PM, Andreas Kurz andr...@hastexo.com wrote:
 
 On 10/06/2012 08:45 AM, Kevin F. La Barre wrote:
 I'm trying to get the fence_ec2 agent (link below) working and a bit
 confused on how it should be configured.  I have modified the agent with
 the EC2 key and cert, region, etc.  The part of confused about is the
 port argument and how it's supposed to work.  Am I supposed to hardcode
 the uname into the port variable or is this somehow passed into the
 script as an argument?   If I hardcode it, I don't understand how
 Pacemaker
 passes on the information as to which node to kill.  Versions and config.
 details follow.

 I apologize if this has been vague.  Please let me know if you need more
 information.

 Fencing agent:

 https://github.com/beekhof/fence_ec2/blob/392a146b232fbf2bf2f75605b1e92baef4be4a01/fence_ec2

 crm configure primitive ec2-fencing stonith::fence_ec2 \
 params action=reboot \
 op monitor interval=60s

 try something like:

 primitive stonith_my-ec2-nodes stonith:fence_ec2 \
 params ec2-home=/root/.ec2 pcmk_host_check=static-list
 pcmk_host_list=myec2-01 myec2-02 \
 op monitor interval=600s timeout=300s \
 op start start-delay=30s interval=0


 ... where the nodenames are sent as port paramter.

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 Corosync v1.4.1
 Pacemaker v1.1.7
 CentOS 6.2

 -Kevin
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems






 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 


-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Active/Active Cluster for DRBD partition

2012-07-25 Thread Andreas Kurz
On 07/24/2012 10:21 AM, Yount, William D wrote:
 I have two servers: KNTCLFS001 and KNTCLFS002
 I have a drbd partition named nfs,  on each server. They are mirrored. The 
 mirroring works perfectly.
 
 What I want is to serve this drbd partition up and have it so that if one 
 server goes down, the drbd partition is still available on the other server. 
 I am trying to do this in an active/active cluster. I have a cloned IP 
 address that is supposed to be running on both machines at the same time.
 
 I can get the resources setup according to the Clusters from Scratch guide. 
 The issue is that as soon as I take KNTCLFS001 down, all my resources go down 
 as well. They won't even stay running on KNTCLFS002.
 

Your configuration does not look like you followed Clusters from
Scratch ... drbd has to be a master-slave resource and not a clone, the
filesystem and IP needs to be ordered to start after drbd promotion and
they need colocation constraints ... that is all step by step explained
in this document. And you miss STONITH, a must-have for dual-primary drbd.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 Someone suggested adding constraints but that doesn't really make sense. I 
 want the services running on all systems at once. Constraints are 
 preconditions for bringing the service up on another machine, but the service 
 should already be running so in theory, constraints shouldn't be necessary.
 
 Any help will be greatly appreciated. I have attached my cib.xml to give 
 anyone a better idea of what I am trying to do.  I have been going through 
 the Pacemaker 1.1 Configuration Explained but so far I haven't found 
 anything.
 
 
 William
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] OpenSuSE 12.1 and Heartbeat

2012-07-18 Thread Andreas Kurz
On 07/18/2012 12:35 PM, claudi...@mediaservice.net wrote:
 Hi to all,
 
 Anyone have build Heartbeat on OpenSuSE 12.1 ? In this new version of
 SuSE, Heartbeat is not maintained anymore, but i have to upgrade 2
 systems using Heartbeat...

Seems to be a good chance to switch to Corosync. If you (hopefully)
already run Pacemaker, the switch is quite easy.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Cordially,
 
 Claudio Prono.
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pacemaker/Corosync issue on Amazon VPC (ec2)

2012-06-30 Thread Andreas Kurz
On 06/29/2012 05:22 PM, Heitor Lessa wrote:
 
 Hi,
  I have installed DRBD+OCFS2 and working Amazon EC2, however as a previous 
 thread suggested we should use Pacemaker in order to get OCFS modified in 
 runtime (modify/del nodes).
  Pacemaker/corosync and other components were very straight forward 
 installing via Lucid-Cluster and Ubuntu-HA, but at the first steps I 
 experienced some problems with CoroSync regarding network connectivity.
  Unfortunately, Amazon does not allow Multicast, so I used udpu once it would 
 be the only way to get it working, but when I started I got same error on 
 logs Even with all traffic allowed, no apparmor (ubuntu), no iptables locally 
 at all:
 
 Jun 29 15:11:11 corosync [TOTEM ] Totem is unable to form a cluster because 
 of an operating system or network fault. The most common cause of this 
 message is that the local firewall is configured improperly.
  Just for sake, I used iperf and netcat to send UDP packets and it is working 
 fine in several ports, so we can rule out firewall issue.
  Any thoughts?

yes .. security groups, adjustable in your EC2 management console.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

  Thank you very much.
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 








signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] WARNING: Resources violate uniqueness

2012-06-29 Thread Andreas Kurz
On 06/29/2012 11:48 AM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote:
 Hi,
 
 I am in the process to setup a cluster using HP iLo3 as stonith devices using 
 SLES11 SP2 HA Extension. (I used formerly heartbeat-1 clusters successfully).
 
 I defined the following primitives for stonith fencing.
 
 primitive st-ilo-rt-lxcl9ar stonith:ipmilan \
 params hostname=rt-lxcl9ar.de.bosch.com ipaddr=10.13.172.85 
 port=623 auth=straight priv=admin login=stonith password=secret
 
 primitive st-ilo-rt-lxcl9br stonith:ipmilan \
 params hostname=rt-lxcl9br.de.bosch.com ipaddr=10.13.172.93 
 port=623 auth=straight priv=admin login=stonith password=secret
 
 Using the location keyword I made sure that the stonith is not running on its 
 own node:
 
 location l-st-rt-lxcl9a st-ilo-rt-lxcl9ar -inf: rt-lxcl9a
 location l-st-rt-lxcl9b st-ilo-rt-lxcl9br -inf: rt-lxcl9b
 
 crm(live)configure# verify leads to some warnings:
 
 WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for 
 parameter port: 623
 WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for 
 parameter password: secret
 WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for 
 parameter auth: straight
 WARNING: Resources st-ilo-rt-lxcl9ar,st-ilo-rt-lxcl9br violate uniqueness for 
 parameter priv: admin
 
 I now have two questions.
 
 1. Why does crm the two distrinct primitives st-ilo-rt-lxcl9ar and 
 st-ilo-rt-lxcl9br to have different parameters? Is this an error in 
 stonith:ipmilan?
 

yes, this is an error in the metadata of ipmilan ... but you should
still be able to commit that configuration.

 2. Is stonith:ipmilan the correct stonith driver for HP iLo3 (DL380 G7)? If 
 yes, how to test if it is really working?

well, I personally prefer external/ipmi ... used it on several
management cards including ilos without a problem.

You can use the stonith command to test it prior to cluster
integration  or -- once configured in pacemaker -- you can also do a
kill -9 corosync on a node and hopefully see it being fenced.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Best regards
 
 Martin Konold
 
 Robert Bosch GmbH
 Automotive Electronics
 Postfach 13 42
 72703 Reutlingen
 GERMANY
 www.bosch.com
 
 Tel. +49 7121 35 3322
 
 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
 Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz 
 Fehrenbach, Siegfried Dais;
 Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph 
 Kübel, Uwe Raschke,
 Wolf-Henning Scheider, Werner Struth, Peter Tyroller
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 







signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] WARNING: Resources violate uniqueness

2012-06-29 Thread Andreas Kurz
On 06/29/2012 12:33 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote:
 Hi,
 
 thanks for the clarification!
 
 1. Why does crm the two distrinct primitives st-ilo-rt-lxcl9ar and 
 st-ilo-rt-lxcl9br to have different parameters? Is this an error in 
 stonith:ipmilan?
 
 yes, this is an error in the metadata of ipmilan ... but you should still be 
 able to commit that configuration.
 
 Yes, the configuration can be commited without problem.
 
 2. Is stonith:ipmilan the correct stonith driver for HP iLo3 (DL380 G7)? If 
 yes, how to test if it is really working?
 
 well, I personally prefer external/ipmi ... used it on several management 
 cards including ilos without a problem.
 
 On the commandline I used successfully:
 
 ipmitool -H rt-lxcl9br.de.bosch.com -I lanplus -A PASSWORD -U stonith -P 
 somepassword power status.
 
 -l lanplus is mandatory to get it working with iLo3 (Firmware 1.28)
 
 Can I make uses of this command using external/ipmi or is lanplus not 
 supported by external/ipmi?

it is supported

 
 When executing
 
 # stonith -t external/ipmi -n
 hostname  ipaddr  userid  passwd  interface

interface is the IPMI interface ... lan or lanplus

 
 I cannot see how to configure lanplus.

crm ra info stonith:external/ipmi

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 Best regards
 
 Martin Konold
 
 Robert Bosch GmbH
 Automotive Electronics
 Postfach 13 42
 72703 Reutlingen
 GERMANY
 www.bosch.com
 
 Tel. +49 7121 35 3322
 
 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
 Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz 
 Fehrenbach, Siegfried Dais;
 Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph 
 Kübel, Uwe Raschke,
 Wolf-Henning Scheider, Werner Struth, Peter Tyroller
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] stonith:external/ipmi was WARNING: Resources violate uniqueness

2012-06-29 Thread Andreas Kurz
On 06/29/2012 03:53 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote:
 Hi Andreas,
 
 thank you very much. Stonith works nicely when doing the 'kill -9 corosync'  
 tests.
 
 When looking at /var/log/messages I can see entries like
 
 Jun 29 15:12:07 rt-lxcl9a stonith-ng: [12589]: WARN: parse_host_line: Could 
 not parse (0 0):

Hmm ... well, if it works ... you can open a support request for your
enterprise server 

 
 I am wondering what causes this warning.
 
 primitive stonith-ilo-rt-lxcl9ar stonith:external/ipmi \
 params hostname=rt-lxcl9a ipaddr=10.13.172.85 userid=stonith 
 passwd=XXX passwd_method=param interface=lanplus
 primitive stonith-ilo-rt-lxcl9br stonith:external/ipmi \
 params hostname=rt-lxcl9b ipaddr=10.13.172.93 userid=stonith 
 passwd=XXX passwd_method=param interface=lanplus
 

looks fine ... it should not be needed but you can try to add (of course
for both devices):

pcmk_host_check=static-list pcmk_host_list=rt-lxcl9a

... to the params list.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 Best regards
 
 Martin Konold
 
 Robert Bosch GmbH
 Automotive Electronics
 Postfach 13 42
 72703 Reutlingen
 GERMANY
 www.bosch.com
 
 Tel. +49 7121 35 3322
 
 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
 Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz 
 Fehrenbach, Siegfried Dais;
 Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph 
 Kübel, Uwe Raschke,
 Wolf-Henning Scheider, Werner Struth, Peter Tyroller
 
 -Original Message-
 From: linux-ha-boun...@lists.linux-ha.org 
 [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andreas Kurz
 Sent: Freitag, 29. Juni 2012 13:06
 To: linux-ha@lists.linux-ha.org
 Subject: Re: [Linux-HA] WARNING: Resources violate uniqueness
 
 On 06/29/2012 12:33 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote:
 Hi,

 thanks for the clarification!

 1. Why does crm the two distrinct primitives st-ilo-rt-lxcl9ar and 
 st-ilo-rt-lxcl9br to have different parameters? Is this an error in 
 stonith:ipmilan?

 yes, this is an error in the metadata of ipmilan ... but you should still 
 be able to commit that configuration.

 Yes, the configuration can be commited without problem.

 2. Is stonith:ipmilan the correct stonith driver for HP iLo3 (DL380 G7)? 
 If yes, how to test if it is really working?

 well, I personally prefer external/ipmi ... used it on several management 
 cards including ilos without a problem.

 On the commandline I used successfully:

 ipmitool -H rt-lxcl9br.de.bosch.com -I lanplus -A PASSWORD -U stonith -P 
 somepassword power status.

 -l lanplus is mandatory to get it working with iLo3 (Firmware 1.28)

 Can I make uses of this command using external/ipmi or is lanplus not 
 supported by external/ipmi?
 
 it is supported
 

 When executing

 # stonith -t external/ipmi -n
 hostname  ipaddr  userid  passwd  interface
 
 interface is the IPMI interface ... lan or lanplus
 

 I cannot see how to configure lanplus.
 
 crm ra info stonith:external/ipmi
 
 Regards,
 Andreas
 
 --
 Need help with Pacemaker?
 http://www.hastexo.com/now
 
 

 Best regards

 Martin Konold

 Robert Bosch GmbH
 Automotive Electronics
 Postfach 13 42
 72703 Reutlingen
 GERMANY
 www.bosch.com

 Tel. +49 7121 35 3322

 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
 Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz 
 Fehrenbach, Siegfried Dais;
 Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph 
 Kübel, Uwe Raschke,
 Wolf-Henning Scheider, Werner Struth, Peter Tyroller
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 








signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] drbd primary/primary for ocfs2 and undetected split brain

2012-06-29 Thread Andreas Kurz
On 06/29/2012 04:02 PM, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) wrote:
 Hi,
 
 I am experiencing an error situation which gets not detected by the cluster.
 
 I created a 2-node cluster using drbd and want to use ocfs2 on both nodes 
 simultaneously. (stripped off some monitor/meta stuff)

bad idea ... pretty useless without the full configuration,
especially the meta attributes in this case. Also share your drbd and
corosycn configuration please. BTW: what is your use case to start with
a simple dual-primary OCFS2 setup?

 
 primitive dlm ocf:pacemaker:controld
 primitive o2cb ocf:ocfs2:o2cb
 primitive resDRBD ocf:linbit:drbd \
 params drbd_resource=r0 \
 operations $id=resDRBD-operations
 primitive resource-fs ocf:heartbeat:Filesystem \
 params device=/dev/drbd_r0 directory=/SHARED fstype=ocfs2
 ms msDRBD resDRBD
 clone clone-dlm dlm
 clone clone-fs resource-fs
 clone clone-ocb o2cb
 colocation colocation-dlm-drbd inf: clone-dlm msDRBD:Master
 colocation colocation-fs-o2cb inf: clone-fs clone-ocb
 colocation colocation-ocation-dlm inf: clone-ocb clone-dlm
 order order-dlm-o2cb 0: clone-dlm clone-ocb
 order order-drbd-dlm 0: msDRBD:promote clone-dlm:start
 order order-o2cb-fs 0: clone-ocb clone-fs
 
 The cluster starts up happily. (everything green in crm_gui) but
 
 rt-lxcl9a:~ # drbd-overview
   0:r0/0  WFConnection Primary/Unknown UpToDate/DUnknown C r-
 rt-lxcl9b:~ # drbd-overview
   0:r0/0  StandAlone Primary/Unknown UpToDate/DUnknown r-
 
 As you can see this is a split brain situation with both nodes having the 
 ocfs2 fs mounted but not in sync -- data loss will happen.
 
 1.  How to avoid split brain situations (I am confident that the cross 
 link using a 10GB cable was never interrupted)?

logs should reveal what happend

 2.  How to resolve this?
 Switch cluster in maintenance mode and then follow 
 http://www.drbd.org/users-guide/s-resolve-split-brain.html ?

at least you also need to stop the filesystem if it is running and you
want to demote one Primary ... and then follow that link

 3.  How to make the cluster aware of the split brain situation? (It 
 thinks everything is fine)

setup fening method resource-and-stonith in drbd configuration,
preferable use the crm-fence-peer.sh stonith script ... Pacemaker
itself or better the DRBD resource agent will not react on such a situation.

 4.  Should the DRBD(OCFS2 setup be maintained outside the cluster instead?

better not ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 Mit freundlichen Grüßen / Best regards
 
 Martin Konold
 
 Robert Bosch GmbH
 Automotive Electronics
 Postfach 13 42
 72703 Reutlingen
 GERMANY
 www.bosch.comhttp://www.bosch.com
 
 Tel. +49 7121 35 3322
 
 Sitz: Stuttgart, Registergericht: Amtsgericht Stuttgart, HRB 14000;
 Aufsichtsratsvorsitzender: Hermann Scholl; Geschäftsführung: Franz 
 Fehrenbach, Siegfried Dais;
 Stefan Asenkerschbaumer, Bernd Bohr, Rudolf Colm, Volkmar Denner, Christoph 
 Kübel, Uwe Raschke,
 Wolf-Henning Scheider, Werner Struth, Peter Tyroller
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Nodes not seeing each other

2012-06-26 Thread Andreas Kurz
On 06/27/2012 12:14 AM, Marcus Bointon wrote:
 
 On 26 Jun 2012, at 22:18, Andreas Kurz wrote:
 
 use STONITH to prevent resources running on both nodes ... you
 configured redundant cluster communication paths?
 
 The nodes in question are Linode VMs, so not much opportunity for that.
 
 With heartbeat you can use the cl_status command with its various
 options to check Heartbeats view of the cluster  and heartbeats log
 messages from the split-brain event should also give you some hints.
 
 cl_status just confirms that each node thinks the other is dead.
 
 ok, I see two things happening in the logs: At one point proxy2 reported a 
 slow heartbeat (20sec, deadtime was set to 15) but seemed to reconnect.
 
 Later on, both nodes reported each other as dead within the same second:
 
 Jun 25 10:14:16 proxy1 heartbeat: [2678]: WARN: node proxy2.example.com: is 
 dead
 Jun 25 10:14:16 proxy1 heartbeat: [2678]: info: Link proxy2.example.com:eth0 
 dead.
 Jun 25 10:14:16 proxy1 crmd: [3205]: notice: crmd_ha_status_callback: Status 
 update: Node proxy2.example.com now has status [dead]

looks like a network problem, yes

 
 As I understand it, STONITH is intended to prevent a node rejoining in case 
 it causes more trouble. In this case the individual nodes were fine, it 
 appeared to be the network that was at fault. Why wouldn't these nodes 
 automatically reconnect, given that there is no STONITH to prevent them? How 
 should I tell them to reconnect manually?
 

STONITH is to make sure a node is really dead before acquiring its
resources ... without stonith and ignored quorum, nodes don't care.

If the network is working as expected again, Heartbeat should reconnect
automatically ... if not, restart Heartbeat if you are confident the
network problem is solved.

Regards,
Andreas

 I can also see that it failed to send alerts from the email resources at the 
 same time because DNS lookups were failing: all points to a wider network 
 issue.
 
 I wonder if Linode has micro-outages on their network since we've also been 
 seeing some problems with mmm reporting 'network unreachable' on some other 
 instances at the same time.
 
 Marcus
 



-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] What's the meaning of ... Failed application of an update diff

2012-06-20 Thread Andreas Kurz
On 06/20/2012 04:35 PM, alain.mou...@bull.net wrote:
 Hi
 
 hb_report does not work. 
 how to do a report tarball ?

It has been renamed to crm_report

Regards,
Andreas

 
 Thanks
 Alain
 
 
 
 De :Andrew Beekhof and...@beekhof.net
 A : General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Date :  19/06/2012 11:37
 Objet : Re: [Linux-HA] What's the meaning of ... Failed application of an 
 update diff
 Envoyé par :linux-ha-boun...@lists.linux-ha.org
 
 
 
 On Tue, Jun 19, 2012 at 6:29 PM, Lars Marowsky-Bree l...@suse.com wrote:
 On 2012-06-19T08:38:11, alain.mou...@bull.net wrote:

 So that means that my modifications by crm configure edit , even if 
 they
 are correct (I've re-checked them) ,
 have potentially corrupt the Pacemaker configuration ?

 No. The CIB automatically recovers from this by doing a full sync. The
 messages are harmless and only indicate an inefficiency, not a real
 problem.
 
 They could be indicative of a bug.  I wouldn't mind seeing a report
 tarball (maybe file a bug for it).
 



 Regards,
Lars

 --
 Architect Storage/HA
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
 Imendörffer, HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar 
 Wilde

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 



-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2)

2012-06-19 Thread Andreas Kurz
On 06/19/2012 09:22 AM, alain.mou...@bull.net wrote:
 Hi 
 
 Sorry but I don't know what iirc means , I suppose iirc in this context 
 stands for plugin.
 If so, how can I check for sure that the plugin is or is not in the 
 pacemaker package ?
 (it is to check the pacemaker package delivered with the new RH 6.3)

Check the pacemaker package if it contains the /etc/init.d/pacemaker
init script, that start the MCP ... if yes, then adopt your
corosync.conf as I wrote in my previous mail.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Thanks 
 Alain
 
 
 
 De :Andrew Beekhof and...@beekhof.net
 A : General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Date :  18/06/2012 23:38
 Objet : Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2)
 Envoyé par :linux-ha-boun...@lists.linux-ha.org
 
 
 
 On Mon, Jun 18, 2012 at 11:04 PM,  alain.mou...@bull.net wrote:
 Hi again,

 could you tell me the package which install the pacemaker plugin v1 
 and
 which
 is the name of the binary or binaries or src ?
 
 its the pacemaker source rpm.  it doesn't ship by default iirc
 

 Thanks a lot
 Alain



 De :Andrew Beekhof and...@beekhof.net
 A : General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Date :  16/06/2012 12:25
 Objet : Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2)
 Envoyé par :linux-ha-boun...@lists.linux-ha.org



 On Fri, Jun 15, 2012 at 10:06 PM,  alain.mou...@bull.net wrote:
 Hi Andrew

 you recall me in an old thread here that effectively cman was not
 involved

 in option 4 : corosync + cpg + quorumd + mcp
 whereas it is involved in option 3 : corosync + cpg + cman + mcp
 but is seems that corosync is also used in both options .

 cman is just a corosync plugin.  think of cman being an alias for
 corosync + cman plugin


 I tried to configure option 3 as you've seen in my other email two days
 ago, and we
 only have a mini cluster.conf file , and no more corosync.conf (and it
 works once
 I start Pacemaker after cman ;-) )

 My question is now :
 when the option 4 will be available, we will come back to the
 corosync.conf file  ?

 yes

 as same as with option 2 and no more cluster.conf  ?

 right

 And to be completely clear on why my question :
 the temporary option 3 forces us to use a mini cluster.conf, and
 therefore
 only one heartbeat network (or two but with bonding).

 I'm pretty sure you can have redundant rings with cluster.conf, I just
 don't know the details.

 But if in the future we configure option 4, and come back to
 corosync.conf, we will be able to have again two networks rings in
 the corosync.conf, and so ... that sounds be much better for me.
 Excpet if quorumd is working with a mini cluster.conf like cman ?

 No. Just corosync.conf
 You can get a preview of how option 4 works here:


 https://www.dropbox.com/s/zd1mi6u1m7ac5t9/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf
 


 I need to finish it off and push to clusterlabs...


 Thanks for these precisions.
 Regards
 Alain
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-19 Thread Andreas Kurz
On 06/19/2012 04:00 AM, Martin Marji Cermak wrote:
 Hello guys,
 I have 3 questions if you please.
 
 I have a HA NFS cluster - Centos 6.2, pacemaker, corosync, two NFS nodes
 plus 1 quorum node, in semi Active-Active configuration.
 By semi, I mean that both NFS nodes are active and each of them is under
 normal circumstances exclusively responsible for one (out of two) Volume
 Group - using the ocf:heartbeat:LVM RA.
 Each LVM volume group lives on a dedicated multipath iscsi device, exported
 from a shared SAN.
 
 I'm exporting a NFSv3/v4 export (/srv/nfs/software_repos directory). I need
 to make it available for 2 separate /21 networks as read-only, and for 3
 different servers as read-write.
 I'm using the ocf:heartbeat:exportfs RA and it seems to me I have to use
 the ocf:heartbeat:exportfs RA 5 times.

If you want to use NFSv4 you also need to export the virtual nfs
file-system root with fsid=0.

In its current incarnation, the exportfs RA does not allow to summarize
different clients with the same export options in one primitive
configuration, though the exportfs command would support it ... patches
are welcome ;-)

 
 
 The configuration (only IP addresses changed) is here:
 http://pastebin.com/eHkgUv64
 
 
 1) is there a way how to export this directory 5 times without defining 5
 ocf:heartbeat:exportfs primitives? It's a lot of duplications...
 I search all the forums and I fear the ocf:heartbeat:exportfs simply
 supports only one host / network range. But maybe someone has been working
 on a patch?

see above ... you maybe can safe a little bit duplications by using
id-refs ...

 
 
 
 2) while using the ocf:heartbeat:exportfs 5 times for the same directory,
 do I have to use the _same_ FSID (201 in my config) for all these 5
 primitives (as Im exporting the _same_ filesystem / directory)?
 I'm getting this warning when doing so
 
 WARNING: Resources
 p_exportfs_software_repos_ae1,p_exportfs_software_repos_ae2,p_exportfs_software_repos_buller,p_exportfs_software_repos_iap-mgmt,p_exportfs_software_repos_youyangs
 violate uniqueness for parameter fsid: 201
 Do you still want to commit?

It is only a warning and as you said, it is the same filesystem.

 
 
 
 3) wait_for_leasetime_on_stop - I believe this must be set to true when
 exporting NFSv4  with ocf:heartbeat:exportfs.
 http://www.linux-ha.org/doc/man-pages/re-ra-exportfs.html
 
 My 5 exportfs primitives reside in the same group:
 
 group g_nas02 p_lvm02 p_exportfs_software_repos_youyangs
 p_exportfs_software_repos_buller p_fs_software_repos
 p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2
 p_exportfs_software_repos_iap-mgmt p_ip02 \
 meta resource-stickiness=101
 
 
 Even though I have the /proc/fs/nfsd/nfsv4gracetime set to 10 seconds, a
 failover of the NFS group from one NFS node to the second node would take
 more than 50 seconds,
 as it will be waiting for each ocf:heartbeat:exportfs resource sleeping 10
 seconds 5 times.
 
 Is there any way of making them fail over / sleeping in parallel, instead
 of sequential?

use resource sets like:

order o_nas02 inf: p_lvm02 ( p_exportfs_software_repos_youyangs
 p_exportfs_software_repos_buller p_fs_software_repos
 p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2
 p_exportfs_software_repos_iap-mgmt ) p_ip02

colocation co_nas02 inf: p_lvm02 p_exportfs_software_repos_youyangs
 p_exportfs_software_repos_buller p_fs_software_repos
 p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2
 p_exportfs_software_repos_iap-mgmt p_ip02

this allows a parallel start and stop of all exports

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 I workarounded this by setting wait_for_leasetime_on_stop=true for only
 one of these (which I believe is safe and does the job it is expected to do
 - please correct me if I'm wrong).
 
 
 
 Thank you for your valuable comments.
 
 My Pacemaker configuration: http://pastebin.com/eHkgUv64
 
 
 [root@irvine ~]# facter | egrep 'lsbdistid|lsbdistrelease'
 lsbdistid = CentOS
 lsbdistrelease = 6.2
 
 [root@irvine ~]# rpm -qa | egrep 'pacemaker|corosync|agents'
 
 corosync-1.4.1-4.el6_2.2.x86_64
 pacemaker-cli-1.1.6-3.el6.x86_64
 pacemaker-libs-1.1.6-3.el6.x86_64
 corosynclib-1.4.1-4.el6_2.2.x86_64
 pacemaker-cluster-libs-1.1.6-3.el6.x86_64
 pacemaker-1.1.6-3.el6.x86_64
 fence-agents-3.1.5-10.el6_2.2.x86_64
 
 resource-agents-3.9.2-7.el6.x86_64
with /usr/lib/ocf/resource.d/heartbeat/exportfs updated by hand from:
 https://github.com/ClusterLabs/resource-agents/commits/master/heartbeat/exportfs
 
 Thank you very much
 Marji Cermak
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org

Re: [Linux-HA] ocf:heartbeat:exportfs multiple exports, fsid, wait_for_leasetime_on_stop

2012-06-19 Thread Andreas Kurz
On 06/19/2012 10:48 AM, Andreas Kurz wrote:
 On 06/19/2012 04:00 AM, Martin Marji Cermak wrote:
 Hello guys,
 I have 3 questions if you please.

 I have a HA NFS cluster - Centos 6.2, pacemaker, corosync, two NFS nodes
 plus 1 quorum node, in semi Active-Active configuration.
 By semi, I mean that both NFS nodes are active and each of them is under
 normal circumstances exclusively responsible for one (out of two) Volume
 Group - using the ocf:heartbeat:LVM RA.
 Each LVM volume group lives on a dedicated multipath iscsi device, exported
 from a shared SAN.

 I'm exporting a NFSv3/v4 export (/srv/nfs/software_repos directory). I need
 to make it available for 2 separate /21 networks as read-only, and for 3
 different servers as read-write.
 I'm using the ocf:heartbeat:exportfs RA and it seems to me I have to use
 the ocf:heartbeat:exportfs RA 5 times.
 
 If you want to use NFSv4 you also need to export the virtual nfs
 file-system root with fsid=0.
 
 In its current incarnation, the exportfs RA does not allow to summarize
 different clients with the same export options in one primitive
 configuration, though the exportfs command would support it ... patches
 are welcome ;-)
 


 The configuration (only IP addresses changed) is here:
 http://pastebin.com/eHkgUv64


 1) is there a way how to export this directory 5 times without defining 5
 ocf:heartbeat:exportfs primitives? It's a lot of duplications...
 I search all the forums and I fear the ocf:heartbeat:exportfs simply
 supports only one host / network range. But maybe someone has been working
 on a patch?
 
 see above ... you maybe can safe a little bit duplications by using
 id-refs ...
 



 2) while using the ocf:heartbeat:exportfs 5 times for the same directory,
 do I have to use the _same_ FSID (201 in my config) for all these 5
 primitives (as Im exporting the _same_ filesystem / directory)?
 I'm getting this warning when doing so

 WARNING: Resources
 p_exportfs_software_repos_ae1,p_exportfs_software_repos_ae2,p_exportfs_software_repos_buller,p_exportfs_software_repos_iap-mgmt,p_exportfs_software_repos_youyangs
 violate uniqueness for parameter fsid: 201
 Do you still want to commit?
 
 It is only a warning and as you said, it is the same filesystem.
 



 3) wait_for_leasetime_on_stop - I believe this must be set to true when
 exporting NFSv4  with ocf:heartbeat:exportfs.
 http://www.linux-ha.org/doc/man-pages/re-ra-exportfs.html

 My 5 exportfs primitives reside in the same group:

 group g_nas02 p_lvm02 p_exportfs_software_repos_youyangs
 p_exportfs_software_repos_buller p_fs_software_repos
 p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2
 p_exportfs_software_repos_iap-mgmt p_ip02 \
 meta resource-stickiness=101


 Even though I have the /proc/fs/nfsd/nfsv4gracetime set to 10 seconds, a
 failover of the NFS group from one NFS node to the second node would take
 more than 50 seconds,
 as it will be waiting for each ocf:heartbeat:exportfs resource sleeping 10
 seconds 5 times.

 Is there any way of making them fail over / sleeping in parallel, instead
 of sequential?
 
 use resource sets like:
 

small correction of myself ;-) ... filesystem-mount has to be before the
exports of course:

 order o_nas02 inf: p_lvm02 ( p_exportfs_software_repos_youyangs
  p_exportfs_software_repos_buller p_fs_software_repos
  p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2
  p_exportfs_software_repos_iap-mgmt ) p_ip02

order o_nas02 inf: p_lvm02 p_fs_software_repos \
(p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller \
  p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 \
  p_exportfs_software_repos_iap-mgmt ) p_ip02

 
 colocation co_nas02 inf: p_lvm02 p_exportfs_software_repos_youyangs
  p_exportfs_software_repos_buller p_fs_software_repos
  p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2
  p_exportfs_software_repos_iap-mgmt p_ip02

colocation co_nas02 inf: p_lvm02 p_fs_software_repos \
p_exportfs_software_repos_youyangs p_exportfs_software_repos_buller \
p_exportfs_software_repos_ae1 p_exportfs_software_repos_ae2 \
p_exportfs_software_repos_iap-mgmt p_ip02

Regards,
Andreas

 
 this allows a parallel start and stop of all exports
 
 Regards,
 Andreas
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2)

2012-06-18 Thread Andreas Kurz
On 06/18/2012 03:04 PM, alain.mou...@bull.net wrote:
 Hi again,
 
 could you tell me the package which install the pacemaker plugin v1 and 
 which
 is the name of the binary or binaries or src ?

You only need to add:

service {
# Load the Pacemaker Cluster Resource Manager
ver:   1
name:  pacemaker
}

to your corosync.conf ... or create a file with this content in
/etc/corosync/service.d/.

Once you started Corosync you need to start the pacemaker init script,
that starts the MCP ... and stop that services in reverse order.

The init script is part of the pacemaker package on RHEL 6.x

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 Thanks a lot
 Alain
 
 
 
 De :Andrew Beekhof and...@beekhof.net
 A : General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Date :  16/06/2012 12:25
 Objet : Re: [Linux-HA] Pacemaker/corosync == Pacemaker/cman (on RH 6.2)
 Envoyé par :linux-ha-boun...@lists.linux-ha.org
 
 
 
 On Fri, Jun 15, 2012 at 10:06 PM,  alain.mou...@bull.net wrote:
 Hi Andrew

 you recall me in an old thread here that effectively cman was not 
 involved

 in option 4 : corosync + cpg + quorumd + mcp
 whereas it is involved in option 3 : corosync + cpg + cman + mcp
 but is seems that corosync is also used in both options .
 
 cman is just a corosync plugin.  think of cman being an alias for
 corosync + cman plugin
 

 I tried to configure option 3 as you've seen in my other email two days
 ago, and we
 only have a mini cluster.conf file , and no more corosync.conf (and it
 works once
 I start Pacemaker after cman ;-) )

 My question is now :
 when the option 4 will be available, we will come back to the
 corosync.conf file  ?
 
 yes
 
 as same as with option 2 and no more cluster.conf  ?
 
 right
 
 And to be completely clear on why my question :
 the temporary option 3 forces us to use a mini cluster.conf, and 
 therefore
 only one heartbeat network (or two but with bonding).
 
 I'm pretty sure you can have redundant rings with cluster.conf, I just
 don't know the details.
 
 But if in the future we configure option 4, and come back to
 corosync.conf, we will be able to have again two networks rings in
 the corosync.conf, and so ... that sounds be much better for me.
 Excpet if quorumd is working with a mini cluster.conf like cman ?
 
 No. Just corosync.conf
 You can get a preview of how option 4 works here:
   
 https://www.dropbox.com/s/zd1mi6u1m7ac5t9/Pacemaker-1.1-Clusters_from_Scratch-en-US.pdf
 
 
 I need to finish it off and push to clusterlabs...
 

 Thanks for these precisions.
 Regards
 Alain
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 







signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Upgrade pacemaker 1.0.9.1 to squeeze-backports?

2012-06-12 Thread Andreas Kurz
Hi,

On 06/12/2012 04:42 PM, Helmut Wollmersdorfer wrote:
 Hi,
 
 are there any known problems with an upgrade from stock debian squeeze  
 to squeeze backports, i.e.

No, it just works ;-)

 
 from
 
 ii  pacemaker   1.0.9.1+hg15626-1 HA cluster resource manager
 ii  heartbeat  1:3.0.3-2Subsystem for High- 
 Availability Linux
 
 to
 
 pacemaker (1.1.7-1~bpo60+1)
 heartbeat (1:3.0.5-2~bpo60+1)

... and cluster-glue, resource-agents

You really want to start thinking about migrating to Corosync, though
Heartbeat still works (up to now).

 
 
 As I have to do a hardware upgrade of both nodes of a 2-node Xen-DRBD- 
 cluster, should a upgrade pacemaker first with just
 
  apt.get ...
 
 or what is the recommended procedure (AFAIK pacemaker 1.0.9 does not  
 support 'maintenance-mode').

maintenance-mode should work fine in Pacemaker 1.0.9, do you have
problems enabling it? ... crm configure property maintenance-mode=true

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 TIA
 
 Helmut Wollmersdorfer
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] VirtualDomain stop error

2012-05-30 Thread Andreas Kurz
Hi Markus,

On 05/30/2012 01:45 PM, Markus Knaup wrote:
 Pawel Warowny warp at master.pl writes:
 

 Dnia Fri, 1 Jul 2011 12:57:32 +0200
 Pawel Warowny warp at master.pl napisał(a):

 Hi

 Errors are always the same: cannot send monitor command
 '{execute:query-balloon}': Connection reset by peer

 Because no one answered, should I report a bug about VirtualDomain
 resource agent and what's the proper way to do it?

 Best regards
 
 
 Hi Pawel,
 
 I have the same problem with my setup (DRBD with a virtual machine with KVM
 running, Pacemaker and Corosync controlling the cluster). When I stop the vm,
 sometimes an error occurs and the vm will not be migrated to the other node. I
 have to clean the resource by hand.
 Did you find a solution?
 Best regars

Can you give us some more information?  what are the errors that
occur? ... grep your logs for VirtualDomain ... do you use the latest
version of the RA?

And what is the output of crm_mon -1fr after this error happened?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Markus
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Error while creating various disk volumes

2012-05-22 Thread Andreas Kurz
On 05/22/2012 02:12 PM, Net Warrior wrote:
 Hi there
 Reading the documentantion I found that I can have various disk in the
 configuration,
 and I need to do so, this is my conf
 
 resource myresource {
 syncer {
 rate 100M;
 }
  volume 0 {
 device/dev/drbd1;
 disk  /dev/rootvg/lvu02;
 meta-disk /dev/rootvg/drbdmetadata[0];
   }
 
  volume 1 {
 device/dev/drbd2;
 disk  /dev/rootvg/lvarch;
 meta-disk /dev/rootvg/drbdmetadata[0];
   }
 
   on node1 {
 address   x.x.x.x:7789;
   }
on node2 {
 address   x.x.x.x:7789;
   }
 
 }
 
 When creating the resource I get the following error
 drbd.d/myresource.res:7: Parse error: 'protocol | on | disk | net | syncer
 | startup | handlers | ignore-on | stacked-on-top-of' expected,
 but got 'volume' (TK 281)

this is a DRBD 8.4 feature

 
 Im using this version
 drbd83-8.3.8-1.el4_8
 **
 
 Any help on this?

Please read the DRBD users guide for version 8.3 and _not_ 8.4 ...
http://www.drbd.org/users-guide-8.3/

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 Thanks you very much for your time and support
 Regards
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems







signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-ha-dev] Rename parameter of conntrackd RA

2012-05-11 Thread Andreas Kurz
Hi all,

Dejan asked me to bring this up on mailing-list.

Please have a look at this proposed patch for conntrackd. It renames the
parameter conntrackd to binary and does a remap of conntrackd to
binary if users of the old RA version use it.

Background: Internally the RA expects a parameter called binary to be
defined. So whatever value users specified in conntrackd parameter, it
was ignored. A default value was used instead.

Please share your thoughts on changing a parameter name of an already
released resource agent.

thx  regards,
Andreas

diff --git a/heartbeat/conntrackd b/heartbeat/conntrackd
index 7502f5a..3ee2f83 100755
--- a/heartbeat/conntrackd
+++ b/heartbeat/conntrackd
@@ -36,7 +36,10 @@
 
 OCF_RESKEY_binary_default=conntrackd
 OCF_RESKEY_config_default=/etc/conntrackd/conntrackd.conf
-: ${OCF_RESKEY_binary=${OCF_RESKEY_binary_default}}
+
+# For users of versions prior to 1.2:
+# Map renamed parameter conntrackd to binary if in use
+: ${OCF_RESKEY_binary=${OCF_RESKEY_conntrackd-${OCF_RESKEY_binary_default}}}
 : ${OCF_RESKEY_config=${OCF_RESKEY_config_default}}
 
 meta_data() {
@@ -44,7 +47,7 @@ meta_data() {
 ?xml version=1.0?
 !DOCTYPE resource-agent SYSTEM ra-api-1.dtd
 resource-agent name=conntrackd
-version1.1/version
+version1.2/version
 
 longdesc lang=en
 Master/Slave OCF Resource Agent for conntrackd
@@ -53,7 +56,7 @@ Master/Slave OCF Resource Agent for conntrackd
 shortdesc lang=enThis resource agent manages conntrackd/shortdesc
 
 parameters
-parameter name=conntrackd
+parameter name=binary
 longdesc lang=enName of the conntrackd executable.
 If conntrackd is installed and available in the default PATH, it is sufficient to configure the name of the binary
 For example my-conntrackd-binary-version-0.9.14
-- 
1.7.4.1




signature.asc
Description: OpenPGP digital signature
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Heartbeat Failover Configuration Question

2012-04-23 Thread Andreas Kurz
On 04/23/2012 01:47 PM, Net Warrior wrote:
 True, but even on the most expensive software likve Veritas Cluster or
 Red Hat Cluster I can configure how I want to failover the resources (
 auto or manual ), that's why my curiosity to acomplish the same in
 here.

with the help of the meat-ware stonith plugin a manual acknowledge of
the failover process is required.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Thanks for your time
 Best Regards
 
 2012/4/23, David Coulson da...@davidcoulson.net:
 Why even use heartbeat then - Just manually ifconfig the interface.

 On 4/23/12 7:39 AM, Net Warrior wrote:
 Hi Nikita

 This is the version
 heartbeat-3.0.0-0.7

 My aim is to, if node1 is powered off or losts it's ethernet
 connection,. node2 wont make the failover automatically,  I want to
 make it manually, but could not find how to accomplish that.


 Thanks for your time and support
 Best regards



 2012/4/23, Nikita Michalkomichalko.sys...@a-i-p.com:
 Hi, Net Warrior!


 What version of HA/Pacemaker do you use?
 Did you already RTFM - e.g.
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained
 - or:
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch


 HTH


 Nikita Michalko

 Am Montag, 23. April 2012 02:23:20 schrieb Net Warrior:
 Hi There

 I configured heartbeat to failover an IP address  , if I for example
 shutdown one node, the other takes it's ip address, so far so good, now
 my doubt is if there is a way to configure it not to make the failover
 automatically and have someone run the failover manually, can you
 provide
 any configuration example please? is this stanza the one that does the
 magic?

 auto_failback on


 Thanks for your time and support
 Best regards
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] problem with pind

2012-04-13 Thread Andreas Kurz
On 04/12/2012 02:59 PM, Trujillo Carmona, Antonio wrote:
 
 I'm try to configure a cluster and I have problem with pingd.
 my config is
 crm(live)configure# show
 node proxy-00
 node proxy-01
 primitive ip-segura ocf:heartbeat:IPaddr2 \
   params ip=10.104.16.123 nic=lan cidr_netmask=19 \
   op monitor interval=10 \
   meta target-role=Started
 primitive pingd ocf:pacemaker:pingd \

use ocf:pacemaker:ping

   params host_list=10.104.16.157 \

and you have to define a monitor operation.

Without any constraints to let the cluster react on connectivity changes
ping resource is useless ... this may help:

http://www.hastexo.com/resources/hints-and-kinks/network-connectivity-check-pacemaker

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

   meta target-role=Started
 property $id=cib-bootstrap-options \
   dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
   cluster-infrastructure=openais \
   stonith-enabled=false \
   no-quorum-policy=ignore \
   expected-quorum-votes=2
 
 crm(live)# status
 
 Last updated: Thu Apr 12 14:54:21 2012
 Last change: Thu Apr 12 14:40:00 2012
 Stack: openais
 Current DC: proxy-00 - partition WITHOUT quorum
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 
 
 Online: [ proxy-00 ]
 OFFLINE: [ proxy-01 ]
 
  ip-segura(ocf::heartbeat:IPaddr2):   Started proxy-00
 
 Failed actions:
 pingd:0_monitor_0 (node=proxy-00, call=5, rc=2, status=complete):
 invalid parameter
 pingd_monitor_0 (node=proxy-00, call=8, rc=2, status=complete):
 invalid parameter
 
 crm(live)resource# start pingd 
 crm(live)resource# status
  ip-segura(ocf::heartbeat:IPaddr2) Started 
  pingd(ocf::pacemaker:pingd) Stopped 
 
 and in the system log I got:
 
 Apr 12 14:55:18 proxy-00 crm_resource: [27941]: ERROR: unpack_rsc_op:
 Hard error - pingd:0_last_failure_0 failed with rc=2: Preventing pingd:0
 from re-starting on proxy-00
 Apr 12 14:55:18 proxy-00 crm_resource: [27941]: ERROR: unpack_rsc_op:
 Hard error - pingd_last_failure_0 failed with rc=2: Preventing pingd
 from re-starting on proxy-00
 
 I have stoped node 2 in order to less problem
 
 ¿I can't found any reference to this error?
 ¿Can you help me? please.
 
 
 
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm configure primitive syntax, please HELP!!!

2012-03-30 Thread Andreas Kurz
On 03/30/2012 01:14 PM, Guglielmo Abbruzzese wrote:
 Hi everybody,
 I've got a question probably very easy for someone I cant' find a proper 
 answer in the official doc.
 In details, if I refer to the official pacemaker documentation related to the 
 command crm configure primitive I find what follows:
 
 usage: primitive rsc [class:[provider:]]type
 [params param=value [param=value...]]
 [meta attribute=value [attribute=value...]]
 [utilization attribute=value [attribute=value...]]
 [operations id_spec
 [op op_type [attribute=value...] ...]]
 
 If I launch the following command by CLI:
 
 crm configure primitive resource_vrt_ip ocf:heartbeat:IPaddr2  params 
 ip=192.168.15.73 nic=bond0 meta target-role=Stopped 
 multiple-active=stop_start migration-treshold=3 failure-timeout=0 
 operations  X  op name=monitor interval=180 timeout=60
 
 I get the following answer: ERROR: operations: only single $id or $id-ref 
 attribute is allowed
 
 Now, I can't get the meaning of the parameter id_spec mentioned in the 
 usage output. Could someone help me, or tell me what shell I write instead of 
  in order to get the following output in the cib??
 

you can omit that operations  X part

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 primitive class=ocf id=resource_vrt_ip provider=heartbeat 
 type=IPaddr2
 instance_attributes id=resource_vrt_ip-instance_attributes
   nvpair id=resource_vrt_ip-instance_attributes-ip name=ip 
 value=192.168.15.73/
   nvpair id=resource_vrt_ip-instance_attributes-nic name=nic 
 value=bond0/
 /instance_attributes
 meta_attributes id=resource_vrt_ip-meta_attributes
   nvpair id=resource_vrt_ip-meta_attributes-target-role 
 name=target-role value=Stopped/
   nvpair id=resource_vrt_ip-meta_attributes-multiple-active 
 name=multiple-active value=stop_start/
   nvpair id=resource_vrt_ip-meta_attributes-migration-threshold 
 name=migration-threshold value=3/
   nvpair id=resource_vrt_ip-meta_attributes-failure-timeout 
 name=failure-timeout value=0/
 /meta_attributes
 operations
   op id=resource_vrt_ip-startup interval=180 name=monitor 
 timeout=60s/
 /operations
   /primitive
 
 P.S. I'd prefer not to load a XML file, I already tried it and it works but 
 it is not the purpose of my help request
 
 Thanks in advance
 
 
 Guglielmo Abbruzzese
 Project Leader 
 
 RESI Informatica S.p.A.
 Via Pontina Km 44,044 
 04011 Aprilia (LT) - Italy
 Tel:   +39 0692710 369
 Fax:  +39 0692710 208
 Email: g.abbruzz...@resi.it
 Web: www.resi.it
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm don't conect to cluster

2012-03-30 Thread Andreas Kurz
On 03/30/2012 08:27 AM, Trujillo Carmona, Antonio wrote:
 
 El mié, 28-03-2012 a las 14:09 +0200, Andreas Kurz escribió:
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 mensaje de correo electrónico adjunto
 - Mensaje reenviado 
 Asunto: 
 Fecha: Wed, 28 Mar 2012 14:46:15 +0200

 For a new cluster you should go with Corosync ... choose either
Heartbeat _or_ Corosync. Don't start both at the same time.

Can you share the corosync.conf that produced that errors you
 showed in
previous mails?

 Ok That's my idea, but how I can't make it run I began to test other
 thing even stupid ones.
 This is my corosync.conf

Looks ok, if all nodes have the same file and the same (and correct)
auth file all should run fine.

Disable secauth and see if it still does not work 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 # cat /etc/corosync/corosync.conf
 # Please read the openais.conf.5 manual page
 
 totem {
   version: 2
 
   # How long before declaring a token lost (ms)
   token: 3000
 
   # How many token retransmits before forming a new configuration
   token_retransmits_before_loss_const: 10
 
   # How long to wait for join messages in the membership protocol (ms)
   join: 60
 
   # How long to wait for consensus to be achieved before starting a new
 round of membership configuration (ms)
   consensus: 3600
 
   # Turn off the virtual synchrony filter
   vsftype: none
 
   # Number of messages that may be sent by one processor on receipt of
 the token
   max_messages: 20
 
   # Limit generated nodeids to 31-bits (positive signed integers)
   clear_node_high_bit: yes
 
   # Disable encryption
   secauth: on
 
   # How many threads to use for encryption/decryption
   threads: 0
 
   # Optionally assign a fixed node id (integer)
   # nodeid: 1234
 
   # This specifies the mode of redundant ring, which may be none, active,
 or passive.
   rrp_mode: none
 
   interface {
   # The following values need to be set based on your environment 
   ringnumber: 0
   bindnetaddr: 10.104.0.0
   mcastaddr: 226.94.1.1
   mcastport: 5405
   }
 }
 
 amf {
   mode: disabled
 }
 
 service {
   # Load the Pacemaker Cluster Resource Manager
   ver:   0
   name:  pacemaker
   use_mgmtd: yes
 }
 
 aisexec {
 user:   root
 group:  root
 }
 
 corosync {
 user: root
 group: root
 }
 
 logging {
 fileline: off
 to_stderr: yes
 to_logfile: no
 to_syslog: yes
   syslog_facility: daemon
 debug: off
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
 }
 }
 
 
 Thank




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm don't conect to cluster

2012-03-28 Thread Andreas Kurz
On 03/28/2012 12:42 PM, Trujillo Carmona, Antonio wrote:
 
 Trying with other way I install heartbeat and right now it work:
  # crm
 crm(live)# status
 
 Last updated: Wed Mar 28 12:33:47 2012
 Stack: Heartbeat
 Current DC: proxy-01 (d0034c01-f613-4d77-a390-05122f3a374c) - partition
 with quorum
 Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
 1 Nodes configured, unknown expected votes
 0 Resources configured.
 
 
 Online: [ proxy-01 ]
 
 crm(live)# config
 ERROR: syntax: config
 crm(live)# configure 
 INFO: building help index
 crm(live)configure# show 
 node $id=d0034c01-f613-4d77-a390-05122f3a374c proxy-01
 property $id=cib-bootstrap-options \
   dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \
   cluster-infrastructure=Heartbeat
 
 But I see that now cluster infrastructure is Heartbeat and not openais.
 So I figure out I have a miss-configuration of openais.
 Really I don't know witch one is better for me openais or heartbeat.
 I only want to check a service (squid) and if it fail change a virtual
 ip from production proxy to standby one.

For a new cluster you should go with Corosync ... choose either
Heartbeat _or_ Corosync. Don't start both at the same time.

Can you share the corosync.conf that produced that errors you showed in
previous mails?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] crm don't conect to cluster

2012-03-27 Thread Andreas Kurz
On 03/27/2012 02:58 PM, Trujillo Carmona, Antonio wrote:
 
 I'm just try to configure a new cluster with pacemaker (based in debian
 stable).
 I follow same instruction I take in other cluster but I can't make it
 work.
 ¿Can you give me same path to check what is the problem?

You get this totem messages on a node that tries to join? You checked
all nodes have the same corosync configuration ... especially the same
secauth options and (if in use) the same authkey?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 When I check crm with crm status I always get:
 root@proxy-00:/etc/network# crm
 crm(live)# status
 
 Connection to cluster failed: connection failed
 In the log I got:
 
 Mar 27 14:21:19 proxy-00 corosync[1465]:   [TOTEM ] Type of received
 message is wrong...  ignoring 86.
 Mar 27 14:21:19 proxy-00 corosync[1465]:   [TOTEM ] Type of received
 message is wrong...  ignoring 113.
 Mar 27 14:21:19 proxy-00 corosync[1465]:   [TOTEM ] Type of received
 message is wrong...  ignoring 18.
 Mar 27 14:21:19 proxy-00 crmd: [1501]: info: ais_dispatch: Membership
 16: quorum still lost
 Mar 27 14:21:20 proxy-00 cib: [1540]: info: write_cib_contents: Wrote
 version 0.0.0 of the CIB to disk (digest:
 ce550593fab3e1d7832aa06b6df0621d)
 Mar 27 14:21:20 proxy-00 cib: [1540]: info: retrieveCib: Reading cluster
 configuration from: /var/lib/heartbeat/crm/cib.yAAkTS
 (digest: /var/lib/heartbeat/crm/cib.IyTINP)
 Mar 27 14:21:20 proxy-00 corosync[1465]:   [TOTEM ] Type of received
 message is wrong...  ignoring 7.
 Mar 27 14:21:21 proxy-00 corosync[1465]:   [TOTEM ] Type of received
 message is wrong...  ignoring 91.
 Mar 27 14:21:21 proxy-00 corosync[1465]:   [TOTEM ] Type of received
 message is wrong...  ignoring 21.
 Mar 27 14:21:21 proxy-00 frox[1362]: Listening on 0.0.0.0:8021
 Mar 27 14:21:21 proxy-00 frox[1362]: Dropped privileges
 Mar 27 14:21:21 proxy-00 corosync[1465]:   [TOTEM ] Type of received
 message is wrong...  ignoring 110.
 Mar 27 14:21:22 proxy-00 attrd: [1499]: ERROR: ais_dispatch: Receiving
 message body failed: (2) Library error: No such file or directory (2)
 Mar 27 14:21:22 proxy-00 attrd: [1499]: ERROR: ais_dispatch: AIS
 connection failed
 Mar 27 14:21:22 proxy-00 cib: [1497]: ERROR: ais_dispatch: Receiving
 message body failed: (2) Library error: Resource temporarily unavailable
 (11)
 Mar 27 14:21:22 proxy-00 attrd: [1499]: CRIT: attrd_ais_destroy: Lost
 connection to OpenAIS service!
 Mar 27 14:21:22 proxy-00 cib: [1497]: ERROR: ais_dispatch: AIS
 connection failed
 Mar 27 14:21:22 proxy-00 attrd: [1499]: info: main: Exiting...
 Mar 27 14:21:22 proxy-00 cib: [1497]: ERROR: cib_ais_destroy: AIS
 connection terminated
 Mar 27 14:21:22 proxy-00 crmd: [1501]: ERROR: ais_dispatch: Receiving
 message body failed: (2) Library error: Resource temporarily unavailable
 (11)
 Mar 27 14:21:22 proxy-00 crmd: [1501]: ERROR: ais_dispatch: AIS
 connection failed
 Mar 27 14:21:22 proxy-00 crmd: [1501]: ERROR: crm_ais_destroy: AIS
 connection terminated
 Mar 27 14:21:22 proxy-00 stonithd: [1496]: ERROR: ais_dispatch:
 Receiving message body failed: (2) Library error: No such file or
 directory (2)
 Mar 27 14:21:22 proxy-00 stonithd: [1496]: ERROR: ais_dispatch: AIS
 connection failed
 Mar 27 14:21:22 proxy-00 stonithd: [1496]: ERROR: AIS connection
 terminated
 Mar 27 14:21:25 proxy-00 kernel: [   18.780256] wan: no IPv6 routers
 present
 Mar 27 14:21:26 proxy-00 kernel: [   19.860256] lan: no IPv6 routers
 present
 
 
 Thank.
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failed RAID leads to failed cluster

2012-03-20 Thread Andreas Kurz
On 03/20/2012 10:41 AM, Christoph Bartoschek wrote:
 Hi,
 
 we have a two node nfs server setup.  Each node has a RAID 6 with 
 Adaptec hardware controller. DRBD synchronizes the blockdevice. On top 
 of it there is a nfs server.
 
 Today our RAID controller failed on the master to rebuild after one 
 harddisk had crashed and the device /dev/sdb1 got unavailable 
 temporarily. I assume this is the case because of the following messages:
 
 Mar 20 04:01:58 laplace kernel: [1786373.892141] sd 0:0:1:0: [sdb] Very 
 big device. Trying to use READ CAPACITY(16).
 Mar 20 04:05:47 laplace kernel: [1786602.053040] block drbd1: peer( 
 Secondary - Unknown ) conn( Connected - TearDown ) pdsk( UpToDate - 
 Outdated )
 
 The cluster then detected failure and tried to promote the slave and to 
 demote the master. This failed because lvm timeout out to get stopped on 
 the master. I assume it tried to write something to the drbd device and 
 failed resulting in the timeout.
 
 
 So my question is. What are we doing wrong? And how can we prevent the 
 failure of the whole cluster in such a situation?

Please share your drbd and cluster configuration ... two lines from log
are not really enough to make suggestions based on facts.

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 Thanks
 Christoph
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Failed RAID leads to failed cluster

2012-03-20 Thread Andreas Kurz
On 03/20/2012 03:04 PM, Christoph Bartoschek wrote:
 Am 20.03.2012 14:42, schrieb Andreas Kurz:

 Please share your drbd and cluster configuration ... two lines from log
 are not really enough to make suggestions based on facts.
 
 I am sure that the raid controller either was blocking or unavailable 
 for some time:
 
 Mar 20 04:04:21 laplace kernel: [1786516.040017] aacraid: Host adapter 
 abort request (0,0,1,0)
 Mar 20 04:04:21 laplace kernel: [1786516.047925] aacraid: Host adapter 
 abort request (0,0,1,0)
 Mar 20 04:04:21 laplace kernel: [1786516.055909] aacraid: Host adapter 
 abort request (0,0,1,0)
 Mar 20 04:04:21 laplace kernel: [1786516.063740] aacraid: Host adapter 
 abort request (0,1,2,0)
 Mar 20 04:04:21 laplace kernel: [1786516.071576] aacraid: Host adapter 
 reset request. SCSI hang ?
 

Too bad, no I/O error so DRBD does a detach of the device ...

 
 Before this was recognized the a monitor event failed:
 
 Mar 20 04:04:05 laplace lrmd: [25177]: debug: perform_ra_op: resetting 
 scheduler class to SCHED_OTHER
 Mar 20 04:04:10 laplace lrmd: [1941]: WARN: p_lvm_nfs:monitor process 
 (PID 25087) timed out (try 1).  Killing with signal SIGTERM (15).
 Mar 20 04:04:10 laplace lrmd: [1941]: WARN: Managed p_lvm_nfs:monitor 
 process 25087 killed by signal 15 [SIGTERM - Termination (ANSI)].
 Mar 20 04:04:10 laplace lrmd: [1941]: WARN: operation monitor[25] on 
 ocf::LVM::p_lvm_nfs for client 1944, its parameters: 
 CRM_meta_name=[monitor] crm_feature_set=[3.0.1] volgrpname=[afs] 
 CRM_meta_timeout=[2] CRM_meta_interval=[3] : pid [25087] timed out
 
 
 Then stopping the LVM resource failed and the cluster broke apart.

Use stonith and the node would have been fenced.

Best Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 
 The drbd.conf is:
 
 global {
  usage-count yes;
 }
 
 common {
syncer {
  rate 125M;
}
 }
 
 resource afs {
protocol C;
 
startup {
  wfc-timeout 0;
  degr-wfc-timeout  120;
}
disk {
  on-io-error detach;
  fencing resource-only;
}
handlers {
  fence-peer /usr/lib/drbd/crm-fence-peer.sh;
  after-resync-target /usr/lib/drbd/crm-unfence-peer.sh;
}
net {
}
on ries {
  device /dev/drbd1;
  disk   /dev/sdb1;
  address10.1.0.2:7788;
  meta-disk  internal;
}
on laplace {
  device /dev/drbd1;
  disk   /dev/sdb1;
  address10.1.0.3:7788;
  meta-disk  internal;
}
 }
 
 
 The crm configuration is:
 
 
 node laplace \
  attributes standby=on
 node ries \
  attributes standby=off
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
  params ip=192.168.143.228 cidr_netmask=24 \
  op monitor interval=30s
 primitive mail ocf:pacemaker:ClusterMon \
  op monitor interval=180 timeout=20 \
  params extra_options=--mail-to admin 
 htmlfile=/tmp/crm_mon.html \
  meta target-role=Started
 primitive p_drbd_nfs ocf:linbit:drbd \
  params drbd_resource=afs \
  op monitor interval=15 role=Master \
  op monitor interval=30 role=Slave
 primitive p_exportfs_afs ocf:heartbeat:exportfs \
  params fsid=1 directory=/srv/nfs/afs 
 options=rw,no_root_squash clientspec=192.168.143.0/255.255.255.0 
 wait_for_leasetime_on_stop=false \
  op monitor interval=30s
 primitive p_fs_afs ocf:heartbeat:Filesystem \
  params device=/dev/afs/afs directory=/srv/nfs/afs 
 fstype=ext4 \
  op monitor interval=10s
 primitive p_lsb_nfsserver lsb:nfs-kernel-server \
  op monitor interval=30s
 primitive p_lvm_nfs ocf:heartbeat:LVM \
  params volgrpname=afs \
  op monitor interval=30s
 group g_nfs p_lvm_nfs p_fs_afs p_exportfs_afs ClusterIP \
  meta target-role=Started
 ms ms_drbd_nfs p_drbd_nfs \
  meta master-max=1 master-node-max=1 clone-max=2 
 clone-node-max=1 notify=true target-role=Started
 clone cl_lsb_nfsserver p_lsb_nfsserver
 clone cl_mail mail
 location drbd-fence-by-handler-ms_drbd_nfs ms_drbd_nfs \
  rule $id=drbd-fence-by-handler-rule-ms_drbd_nfs 
 $role=Master -inf: #uname ne ries
 colocation c_nfs_on_drbd inf: g_nfs ms_drbd_nfs:Master
 order o_drbd_before_nfs inf: ms_drbd_nfs:promote g_nfs:start
 property $id=cib-bootstrap-options \
  dc-version=1.0.9-unknown \
  cluster-infrastructure=openais \
  expected-quorum-votes=3 \
  stonith-enabled=false \
  no-quorum-policy=ignore \
  last-lrm-refresh=1332235117
 rsc_defaults $id=rsc-options \
  resource-stickiness=200
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

2012-03-16 Thread Andreas Kurz
On 03/15/2012 11:50 PM, William Seligman wrote:
 On 3/15/12 6:07 PM, William Seligman wrote:
 On 3/15/12 6:05 PM, William Seligman wrote:
 On 3/15/12 4:57 PM, emmanuel segura wrote:

 we can try to understand what happen when clvm hang

 edit the /etc/lvm/lvm.conf  and change level = 7 in the log session and
 uncomment this line

 file = /var/log/lvm2.log

 Here's the tail end of the file (the original is 1.6M). Because there no 
 times
 in the log, it's hard for me to point you to the point where I crashed the 
 other
 system. I think (though I'm not sure) that the crash happened after the last
 occurrence of

 cache/lvmcache.c:1484   Wiping internal VG cache

 Honestly, it looks like a wall of text to me. Does it suggest anything to 
 you?

 Maybe it would help if I included the link to the pastebin where I put the
 output: http://pastebin.com/8pgW3Muw
 
 Could the problem be with lvm+drbd?
 
 In lvm2.conf, I see this sequence of lines pre-crash:
 
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 device/dev-io.c:588   Closed /dev/drbd0
 
 I interpret this: Look at /dev/md0, get some info, close; look at /dev/drbd0,
 get some info, close.
 
 Post-crash, I see:
 
 evice/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:271   /dev/md0: size is 1027968 sectors
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 device/dev-io.c:588   Closed /dev/md0
 filters/filter-composite.c:31   Using /dev/md0
 device/dev-io.c:535   Opened /dev/md0 RO O_DIRECT
 device/dev-io.c:137   /dev/md0: block size is 1024 bytes
 label/label.c:186   /dev/md0: No label detected
 device/dev-io.c:588   Closed /dev/md0
 device/dev-io.c:535   Opened /dev/drbd0 RO O_DIRECT
 device/dev-io.c:271   /dev/drbd0: size is 5611549368 sectors
 device/dev-io.c:137   /dev/drbd0: block size is 4096 bytes
 
 ... and then it hangs. Comparing the two, it looks like it can't close 
 /dev/drbd0.
 
 If I look at /proc/drbd when I crash one node, I see this:
 
 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C s-
 ns:764 nr:0 dw:0 dr:7049728 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b 
 oos:0

s- ... DRBD suspended io, most likely because of it's
fencing-policy. For valid dual-primary setups you have to use
resource-and-stonith policy and a working fence-peer handler. In
this mode I/O is suspended until fencing of peer was succesful. Question
is, why the peer does _not_ also suspend its I/O because obviously
fencing was not successful .

So with a correct DRBD configuration one of your nodes should already
have been fenced because of connection loss between nodes (on drbd
replication link).

You can use e.g. that nice fencing script:

http://goo.gl/O4N8f

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 If I look at /proc/drbd if I bring down one node gracefully (crm node 
 standby),
 I get this:
 
 # cat /proc/drbd
 version: 8.3.12 (api:88/proto:86-96)
 GIT-hash: e2a8ef4656be026bbae540305fcb998a5991090f build by
 r...@hypatia-tb.nevis.columbia.edu, 2012-02-28 18:01:34
  0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/Outdated C r-
 ns:764 nr:40 dw:40 dr:7036496 al:0 bm:516 lo:0 pe:0 ua:0 ap:0 ep:1 
 wo:b
 oos:0
 
 Could it be that drbd can't respond to certain requests from lvm if the state 
 of
 the peer is DUnknown instead of Outdated?
 
 Il giorno 15 marzo 2012 20:50, William Seligman 
 selig...@nevis.columbia.edu
 ha scritto:

 On 3/15/12 12:55 PM, emmanuel segura wrote:

 I don't see any error and the answer for your question it's yes

 can you show me your /etc/cluster/cluster.conf and your crm configure
 show

Re: [Linux-HA] stonith/fence using external/libvirt on KVM

2012-02-24 Thread Andreas Kurz
On 02/23/2012 02:59 PM, Tom Hanstra wrote:
 Hmmm, this is something which I did not understand when starting to look 
 into this.  If this is the case, it would be nice if the web pages were 
 updated accordingly.

You mean linux-ha.com? ... yeah that might be true. But looking at
clusterlabs.org makes it quite clear, that corosync is the way to go for
new setups ... there are also some nice faqs:

http://clusterlabs.org/wiki/FAQ

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com

 
 Tom
 
 On 02/22/2012 05:48 PM, Andreas Kurz wrote:
 Since heartbeat is not actively developed any more, corosync is the way
 to go for a future proof setup.

 



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] stonith/fence using external/libvirt on KVM

2012-02-22 Thread Andreas Kurz
Hello Tom,

On 02/22/2012 08:01 PM, Tom Hanstra wrote:
 I'm new to the Linux-HA clustering, though I've had experience
 with RedHat's Cluster packages for several years. I'm trying to see how
 the open source software compares.
 
 So, I set up two KVM Virtual servers running RHEL6 and compiled and
 installed the Cluster Glue, Heartbeat, and Pacemaker software. I was
 able to get two nodes running, though there are some errors which I will
 need to track down.

Oh ... why did you build the complete stack manually? Pacemaker is
technology preview in RHEL6 and it ships latest version  in
combination with corosync instead of Heartbeat this works really fine.

 
  From my other cluster experience, I know that getting fencing/stonith
 set up properly is something necessary and I want to work on that even
 before I try to track down other problems further. Without the ability
 to kill off a node, odd things can happen. So, my focus right now is on
 finding a working stonith device for this setup.
 
 I got all of the pieces I think I need for the external/libvirt device,
 have fence_virtd running on the host box and I do get output on both
 host and clients from the fence_xvm command:
 
 1023$ fence_xvm -o list
 RH5_LIS0 25132742-8e3a-a1f2-a862-de3705ea8d8f on
 RH5_LIS1 2b4d4813-0107-6aec-a66f-2159ec95da4c on
 RH5_LIS2 fa6e2603-f7d6-34fa-dd03-4886cdf6e44b on
 RH5_LIS3 aafc6639-2d29-8bbe-4d62-38498f390563 on
 RH6_WITS7 51c15635-889f-7213-b0e2-e213f771e52a on
 RH6_WITS8 2ffdfebe-d49b-698b-b76c-a4abd8cbf42a on
 
 Where I am running into problems right now is translating the
 information I have from this command into the proper setup and syntax to
 set this as a stonith device and actually test killing off a node. The
 information given by this command gives the names of the virtual
 machines. But in my cluster setup, I have given these node names:
 
 lv7-eli = RH6_WITS7
 lv8-eli = RH6_WITS8

Try something like that for a single host setup:

primitive stonith_lv7-eli stonith:fence_virt \
params pcmk_host_check=static-list \
pcmk_host_list=lv7-eli \
port=RH6_WITS7 \
op monitor interval=600s

... and the same for the other node with adopted names. You should also
take care to run the stonith resources not on that node that can be
fenced by it ... like:

location l_stonith_lv7-eli stonith_lv7-eli -inf: lv7-eli

 
 What is the proper stonith command that will actually kill off a node in
 such a KVM setup? And how does that translate into settings I would add
 to my ha.cf file?

Even if you continue to use Heartbeat ccm instead of corosync, there is
nothing to be added to ha.cf, all stonith resource configuration is done
in the cib.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/services/remote

 
 Thanks,
 Tom Hanstra
 t...@nd.edu
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] stonith/fence using external/libvirt on KVM

2012-02-22 Thread Andreas Kurz
Hello,

On 02/22/2012 11:37 PM, Tom Hanstra wrote:
 See my further information with TH below...
 
 On 02/22/2012 05:09 PM, Andreas Kurz wrote:
 Hello Tom,

 On 02/22/2012 08:01 PM, Tom Hanstra wrote:
 I'm new to the Linux-HA clustering, though I've had experience
 with RedHat's Cluster packages for several years. I'm trying to see how
 the open source software compares.

 So, I set up two KVM Virtual servers running RHEL6 and compiled and
 installed the Cluster Glue, Heartbeat, and Pacemaker software. I was
 able to get two nodes running, though there are some errors which I will
 need to track down.
 Oh ... why did you build the complete stack manually? Pacemaker is
 technology preview in RHEL6 and it ships latest version  in
 combination with corosync instead of Heartbeat this works really fine.

 TH Unfortunately, I'm limited to the educational version of RHEL6 
 which does not include any of the clustering software without additional 
 charges.  I just did a check on both corosync and pacemaker.  For 
 corosync, the packages show up but are inaccessible; for pacemaker, only 
 pacemaker-cts is available.  I'm not sure if this is sufficient but 
 doubt it.

I see ... well, Centos and Scientific Linux have all packages in their
repos ...

 
 But is corosync better than heartbeat?  Or am I getting into a religious 
 war by asking that?

Since heartbeat is not actively developed any more, corosync is the way
to go for a future proof setup.

 
 
From my other cluster experience, I know that getting fencing/stonith
 set up properly is something necessary and I want to work on that even
 before I try to track down other problems further. Without the ability
 to kill off a node, odd things can happen. So, my focus right now is on
 finding a working stonith device for this setup.

 I got all of the pieces I think I need for the external/libvirt device,
 have fence_virtd running on the host box and I do get output on both
 host and clients from the fence_xvm command:

 1023$ fence_xvm -o list
 RH5_LIS0 25132742-8e3a-a1f2-a862-de3705ea8d8f on
 RH5_LIS1 2b4d4813-0107-6aec-a66f-2159ec95da4c on
 RH5_LIS2 fa6e2603-f7d6-34fa-dd03-4886cdf6e44b on
 RH5_LIS3 aafc6639-2d29-8bbe-4d62-38498f390563 on
 RH6_WITS7 51c15635-889f-7213-b0e2-e213f771e52a on
 RH6_WITS8 2ffdfebe-d49b-698b-b76c-a4abd8cbf42a on

 Where I am running into problems right now is translating the
 information I have from this command into the proper setup and syntax to
 set this as a stonith device and actually test killing off a node. The
 information given by this command gives the names of the virtual
 machines. But in my cluster setup, I have given these node names:

 lv7-eli = RH6_WITS7
 lv8-eli = RH6_WITS8
 Try something like that for a single host setup:

 primitive stonith_lv7-eli stonith:fence_virt \
  params pcmk_host_check=static-list \
  pcmk_host_list=lv7-eli \
  port=RH6_WITS7 \
  op monitor interval=600s

 TH Bear with me a bit.  This is a crm configuration command, right.  
 Can you help me understand where the information gets stored when I 
 issue this command?  I was thinking it would go to a file somewhere, but 
 as you mention later, this information does not come from the ha.cf 
 file.  Where does it go?

The cib.xml file is stored in /var/lib/heartbeat/crm directory and
propagated to all nodes ... dont't manipulate it manually  crm
configure show gives you the crm syntax version which is much easier to
read.

 ... and the same for the other node with adopted names. You should also
 take care to run the stonith resources not on that node that can be
 fenced by it ... like:

 location l_stonith_lv7-eli stonith_lv7-eli -inf: lv7-eli

 TH I'm not clear on what you mean here.  This another configuration 
 command, but I don't understand what it is doing.  In my two node 
 cluster, each node should be able to fence off the other.  How does this 
 command help to accomplish that?

This is a location constraints that disallows the stonith resource
capable of fencing lv7-eli to run on node lv7-eli.

 
 What is the proper stonith command that will actually kill off a node in
 such a KVM setup? And how does that translate into settings I would add
 to my ha.cf file?
 Even if you continue to use Heartbeat ccm instead of corosync, there is
 nothing to be added to ha.cf, all stonith resource configuration is done
 in the cib.

 Regards,
 Andreas

 Thanks for your help,

You are welcome!

 Tom
 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/services/custom-training




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apache wont start on VIP

2012-02-06 Thread Andreas Kurz
Hello,

On 02/05/2012 01:57 AM, mike wrote:
 Hi all,
 
 I've got a very simple 2 node setup. It runs a few VIPs and uses 
 ldirectord to load balance. That part works perfectly.
 
 On the same 2 node cluster I have apache running and it fails back and 
 forth fine as long as ports.conf is set to listen on all ip's. I do have 
 a VIP - 192.168.2.2 that I want Apache to start up on. I have tested 
 apache manually on both nodes by editing ports.conf and setting it to 
 192.168.2.2 and starting apache from the command line - works fine. When 

Please share your config if you want authoritative answers ... without
any further information it looks like a missing order constraint between
apache and its IP.

You should find logging output from apache RA in your syslogs 

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 I try to get Apache to start with HA it fails. Here is what my current 
 set up looks like:
 
 Last updated: Sat Feb  4 20:50:25 2012
 Stack: Heartbeat
 Current DC: firethorn (3125c95a-33d1-4923-a5c0-38b228f90ecf) - partition 
 with quorum
 Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
 2 Nodes configured, unknown expected votes
 3 Resources configured.
 
 
 Online: [ firethorn vanderbilt ]
 
   Resource Group: Web_Cluster_IP
   ApacheIP   (ocf::heartbeat:IPaddr2):   Started firethorn
   Resource Group: Web_Cluster
   WebSite(ocf::heartbeat:apache):Started firethorn
   ClusterIP  (ocf::heartbeat:IPaddr2):   Started firethorn
   ClusterIP2 (ocf::heartbeat:IPaddr2):   Started firethorn
   Resource Group: LVS_Cluster
   LdirectorIP(ocf::heartbeat:IPaddr2):   Started firethorn
   ldirectord (ocf::heartbeat:ldirectord):Started firethorn
 
 ApacheIP starts up first before apache does so I'm at a loss to 
 understand why HA has issues starting apache. The logs are not revealing 
 unfortunately. Initially I had the ApacheIP and apache in the same 
 resource group so I thought maybe if I break the IP out into its own 
 group then I'd know it comes up first and apache should follow. It 
 doesn't sadly.
 
 So am I missing something obvious here? Why does apache start from the 
 command line but not in HA unless ports.conf is set to listen on all 
 interfaces?
 
 Thanks
 -mike
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Status about ocfs2.pcmk ?

2012-02-03 Thread Andreas Kurz
Hello,

On 02/03/2012 09:29 AM, alain.mou...@bull.net wrote:
 Hi Andreas ,
 thanks for your response, but two questions :
 1/ why going with GFS2 ? because you know that ocfs2+pacemaker still does 
 not
 work fine on rhel ? or ... ? 

Because GFS2 is actively developed mostly by Redhat including the parts
needed to glue it to Pacemaker and there have been some threads on the
Pacemaker ML.

I know it works with SLES11 SP1 with the packages shipped in the HA
extension and also latest Debian/Ubuntu Packages should work.

 2/ you're right GFS2 is working much better with pacemaker than OCFS2, but 
 the problem
 is that GFS2 was about 10 times less efficient with regard to IO 
 benchs than OCFS2 ! 

Never compared them by myself, I try to avoid using cluster file
systems. What is your use case?

 Is this status has changed since 2010 ? I dont' think so when watching 
 all the messages
 on Mailing List ... but I'm not sure ..

I have the same impression, yes.

Regards,
Andreas


 Thanks
 Alain
 
 
 
 De :Andreas Kurz andr...@hastexo.com
 A : linux-ha@lists.linux-ha.org
 Date :  02/02/2012 15:47
 Objet : Re: [Linux-HA] Status about ocfs2.pcmk ?
 Envoyé par :linux-ha-boun...@lists.linux-ha.org
 
 
 
 On 02/02/2012 02:54 PM, alain.mou...@bull.net wrote:
 Hi

 Just wonder if someone has succeded to configured a working HA 
 configuration with Pacemaker/corosync
 and OCFS2 file systems, meaning using ocfs2.pcmk , on RHEL6 mainly (and 
 eventually SLES11) ?
 
 For RHEL6 I'd recommend you go with GFS2 and follow the Cluster from
 Scratch documentation ... I know OCFS2 on SLES11 SP1 is working fine in
 a Pacemaker cluster.
 
 Regards,
 Andreas
 

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Status about ocfs2.pcmk ?

2012-02-02 Thread Andreas Kurz
On 02/02/2012 02:54 PM, alain.mou...@bull.net wrote:
 Hi
 
 Just wonder if someone has succeded to configured a working HA 
 configuration with Pacemaker/corosync
 and OCFS2 file systems, meaning using ocfs2.pcmk , on RHEL6 mainly (and 
 eventually SLES11) ?

For RHEL6 I'd recommend you go with GFS2 and follow the Cluster from
Scratch documentation ... I know OCFS2 on SLES11 SP1 is working fine in
a Pacemaker cluster.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 (I tried at the end of 2010 but gave up after a few weeks because it was 
 not working at all)
 
 Thanks if someone can give a status?
 Regards
 Alain Moullé
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together

2012-01-13 Thread Andreas Kurz
Hello,

On 01/13/2012 08:39 AM, Niclas Müller wrote:
 Is it necessary to put services like drbd, apache or mysql into 
 pacemaker as a resource ?
 It worked without that, but is it better to add this as a service ?

If you want them to be highly available, means monitored and
automatically started/stopped when needed  then yes, add it as a
service.

Pacemaker in combination with STONITH can also help you in split-brain
scenarios.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 On 01/13/2012 01:59 AM, Andreas Kurz wrote:
 Hello,

 On 01/12/2012 11:29 PM, Niclas Müller wrote:

 That the VirtualIP isn't shown by 'ifconfig -a' is realy nice, because i
 made my failed search on this because of this howto :

 http://www.howtoforge.com/high_availability_loadbalanced_apache_cluster
  
 You follow a howto from the year 2006? ... anyway, ifconfig would show
 the IP because you used IPaddr and not IPaddr2 RA 





 Im going to understand now all. I've configurated a resource for
 failover-ip from your link, but get this error page. Pacemaker cannot
 start resource failover-ip
  
 of course you already have an interface up in this network? The resource
 agent only adds secondary addresses 

 And you should find most logging information on your DC - node2 in
 /var/log/{syslog,daemon.log)

 Regards,
 Andreas




 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together

2012-01-13 Thread Andreas Kurz
On 01/13/2012 10:34 AM, Niclas Müller wrote:
 I've got it that apache and failover-ip runns on the cluster. I'm haning 
 on the problem that hearbeat starts apache on node1 and failover-ip on 
 node2 if both of them are running. Only if one node is offline, both 
 resources are running on the online node.
 
 How can i say pacemaker - run this two resources only on one node 
 everytime ?

you need constraints ... don't stop reading documentation ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 
 
 On 01/13/2012 08:39 AM, Niclas Müller wrote:
 Is it necessary to put services like drbd, apache or mysql into
 pacemaker as a resource ?
 It worked without that, but is it better to add this as a service ?


 On 01/13/2012 01:59 AM, Andreas Kurz wrote:

 Hello,

 On 01/12/2012 11:29 PM, Niclas Müller wrote:

  
 That the VirtualIP isn't shown by 'ifconfig -a' is realy nice, because i
 made my failed search on this because of this howto :

 http://www.howtoforge.com/high_availability_loadbalanced_apache_cluster


 You follow a howto from the year 2006? ... anyway, ifconfig would show
 the IP because you used IPaddr and not IPaddr2 RA 


  


 Im going to understand now all. I've configurated a resource for
 failover-ip from your link, but get this error page. Pacemaker cannot
 start resource failover-ip


 of course you already have an interface up in this network? The resource
 agent only adds secondary addresses 

 And you should find most logging information on your DC - node2 in
 /var/log/{syslog,daemon.log)

 Regards,
 Andreas




 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
  
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together

2012-01-12 Thread Andreas Kurz
On 01/12/2012 10:22 PM, Niclas Müller wrote:
 I'm currently going to setup a Linux HA Cluster with apache and MySQL.
 I've created three VM with KVM Vitalization. One NetworkManager as DNS
 and DHCP Server, and two other as Cluster Nodes. All VMs are both Debian
 Squeeze minimal installations. On the Nodes i've installed the packages
 heartbeat and pacemaker. The configuration of heartbeat seems like
 correct because in the syslog there are no errors and i can read that
 the nodes have contact. My first impression of the software packages is,
 that heartbeat is only for checking the Nodes of ability. Pacemaker is
 for the services which are to manager. I cannot get it on that heartbeat
 / pacemaker (??) use the virtual ip. In both interfaces there is no
 eth0:0 with the configurated ip address. Is the virtual ip used by the
 primary node only when a service is confiugrated? Have anybody a good
 howto for set up a apache and mysql cluster with heartbeat and pacemaker
 ?

yes, Heartbeat is the CCM (cluster consensus and membership) layer and
Pacemaker relies on it to get valid information about node health and
uses it to transfer messages/updates to the other nodes.

use crm_mon -1fr to see the current state of your cluster resources
you already?? configured in Pacemaker.

if you use IPaddr2 RA you have to use ip addr show to see the virtual IPs.

there are a lot of howtos available, maybe you want to start at:

http://www.clusterlabs.org/wiki/Documentation#Examples


Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 Thank's Niclas
 
 
 
 P.S. : A user have send me a mail that with the mail client i send the 
 question before made a mistake with my FORM Field, hopefully now away!
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Single Point of Failure

2012-01-12 Thread Andreas Kurz
On 01/13/2012 12:22 AM, Paul O'Rorke wrote:
 hmmm - it looks like I may have to re-evaluate this.
 
 Geographic redundency is the point of this exercise, our office is in a
 location that has is less than ideal history for power reliability.  We are
 a small software company and rely on email for online sales and product
 delivery so our solution - what ever it be - must allow for one location to
 completely lose power and still deliver client emails.

as Dimitri says ... you really want to have a look at Google apps for
business ...

Regards,
Andreas

 
 Mail is a very complex subject and I must confess that the excellent
 suggestions made here may be a little more than I was prepared to dive into.
 
 Given that this is a HA-Linux list, and that if I understand this correctly
 it is not really designed for multi-site clusters, can anyone suggest a
 more suitable technology? (the server is running CentOS/Exim)
 
 Or perhaps I should be doing the grunt work and trying out some of the
 above suggestions...
 
 I do appreciate the excellent feedback to date!
 
 thanks
 
 On Thu, Jan 12, 2012 at 1:53 PM, Arnold Krille arn...@arnoldarts.de wrote:
 
 On Thursday 12 January 2012 22:14:41 Jakob Curdes wrote:
 Miles Fidelman wrote:
 - you can set up a 2ndry server (give it an MX record with lower
 priority than the primary server) - it will receive mail when the
 primary goes down; and you can set up the mail config to forward stuff
 automatically to the primary server when it comes back up -- people
 won't be able to get to their mail until the primary comes back up, but
 mail will get accepted and will eventually get delivered

 Just one additional note: in such a setup, you should not assume that
 the secondary server only receives mail when the first one is down from
 your side of view.
 A client somewhere might have a different connectivity view and might
 deliver mail to your secondary MX at any time. It is well-known that
 spammer systems even try to deliver to the secondary in the hope that
 protection there is lower. So, if you have a secondary, you must arrange
 for mail delivered to that server to be passed on to the primary or a
 separate backend server. And you need to protect it exactly as good as
 your primary against virus, spam, and DOS attacks.

 So: If you go through the hazzles to set up a second receiving host with
 the
 same quality and administration requirements as the first one, you will
 also
 want to reflect that by giving it an equally high score in the mx field.
 That
 way both servers will be used equally and you get load-balancing where you
 originally meant to buy hot-standby:-)

 Another comment from here: Email is such an old protocol that the immunity
 to
 network errors was built in. If a sending host can't reach the receiver, it
 will try again after some time. And then again and again until a timeout is
 reached. And that timeout is not 2-4 seconds like with many tcp-based
 protocols but 4 days giving the admins the chance on monday to fix the
 mailserver that crashed on friday evening. Of course, if you rely on fast
 mail for your business, the price of redundant smtp and redundant pop3/imap
 servers might pay off.
 For redundant pop3/imap the cyrus project (and probably the other too)
 seem to
 have a special daemon to sync mails and mail-actions across servers. Add a
 redundant master-slave replicating mysql (or postgres) for the account
 database or even ldap and you should get something that even scales beyond
 2
 machine. Completely off-topic for this list as I haven't thrown in any
 heartbeat, pacemaker, corosync or drbd at this point.

 Have fun,

 Arnold

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

 
 
 

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Resended : Understanding how heartbeat and pacemaker work together

2012-01-12 Thread Andreas Kurz
Hello,

On 01/12/2012 11:29 PM, Niclas Müller wrote:
 That the VirtualIP isn't shown by 'ifconfig -a' is realy nice, because i 
 made my failed search on this because of this howto :
 
 http://www.howtoforge.com/high_availability_loadbalanced_apache_cluster

You follow a howto from the year 2006? ... anyway, ifconfig would show
the IP because you used IPaddr and not IPaddr2 RA 

 
 
 
 
 Im going to understand now all. I've configurated a resource for 
 failover-ip from your link, but get this error page. Pacemaker cannot 
 start resource failover-ip

of course you already have an interface up in this network? The resource
agent only adds secondary addresses 

And you should find most logging information on your DC - node2 in
/var/log/{syslog,daemon.log)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 
 
 
 root@node1:/etc/ha.d# crm_mon -1fr
 
 Last updated: Thu Jan 12 17:25:43 2012
 Stack: Heartbeat
 Current DC: node2 (40dbaed1-9618-41f0-acbe-c3f0f6334cce) - partition 
 with quorum
 Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
 2 Nodes configured, unknown expected votes
 1 Resources configured.
 
 
 Online: [ node1 node2 ]
 
 Full list of resources:
 
   failover-ip(ocf::heartbeat:IPaddr):Stopped
 
 Migration summary:
 * Node node1:
 failover-ip: migration-threshold=100 fail-count=100
 * Node node2:
 failover-ip: migration-threshold=100 fail-count=100
 
 Failed actions:
  failover-ip_start_0 (node=node1, call=4, rc=1, status=complete): 
 unknown error
  failover-ip_start_0 (node=node2, call=4, rc=1, status=complete): 
 unknown error
 
 
 
 
 
 node $id=40dbaed1-9618-41f0-acbe-c3f0f6334cce node2
 node $id=6c62b04f-0d3a-4bc5-a084-ffba618a8e87 node1
 primitive failover-ip ocf:heartbeat:IPaddr \
  params ip=192.168.0.100 \
  op monitor interval=10s \
  meta target-role=Started
 property $id=cib-bootstrap-options \
  dc-version=1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b \
  cluster-infrastructure=Heartbeat \
  stonith-enabled=false
 
 
 
 
 
 
 Thank'a lot, the switch in my brain make PUK and now it seems all clear 
 how this work together. Syslog doesn't show any errors about the 
 failover-ip resource. Ideas ?
 
 
 
 
 
 On 01/12/2012 10:55 PM, Andreas Kurz wrote:
 On 01/12/2012 10:22 PM, Niclas Müller wrote:

 I'm currently going to setup a Linux HA Cluster with apache and MySQL.
 I've created three VM with KVM Vitalization. One NetworkManager as DNS
 and DHCP Server, and two other as Cluster Nodes. All VMs are both Debian
 Squeeze minimal installations. On the Nodes i've installed the packages
 heartbeat and pacemaker. The configuration of heartbeat seems like
 correct because in the syslog there are no errors and i can read that
 the nodes have contact. My first impression of the software packages is,
 that heartbeat is only for checking the Nodes of ability. Pacemaker is
 for the services which are to manager. I cannot get it on that heartbeat
 / pacemaker (??) use the virtual ip. In both interfaces there is no
 eth0:0 with the configurated ip address. Is the virtual ip used by the
 primary node only when a service is confiugrated? Have anybody a good
 howto for set up a apache and mysql cluster with heartbeat and pacemaker
 ?
  
 yes, Heartbeat is the CCM (cluster consensus and membership) layer and
 Pacemaker relies on it to get valid information about node health and
 uses it to transfer messages/updates to the other nodes.

 use crm_mon -1fr to see the current state of your cluster resources
 you already?? configured in Pacemaker.

 if you use IPaddr2 RA you have to use ip addr show to see the virtual IPs.

 there are a lot of howtos available, maybe you want to start at:

 http://www.clusterlabs.org/wiki/Documentation#Examples


 Regards,
 Andreas




 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] if promote runs into timeout

2012-01-10 Thread Andreas Kurz
Hello,

On 01/06/2012 06:14 PM, erkan yanar wrote:
 
 Moin,
 
 Im having the issue, that promoting a master can run into the promote timeout.
 After that, the resource is stopped and started as a slave.
 
 In my example it is a mysql resource, where promoting is going to  wait for 
 any replication lag to be
 applied. This could last a very long time.

If the resource is not ready be promoted it should have no promotion
score ... is this the mysql RA coming with the resource-agents package
or a home-grown RA?

 
 There are some thoughts on that issue:
 1. Dynamically increase the timeout with cibadmin. I havent tested that yet. 
 Would this work?
 2. op-fail=ignore
 With ignore, the resource is not restarted. But I don't like that approach.
 
 Is there an intelligent approach to dynamically change the timeout while 
 promoting?
 Or is there a better approach anyway?

Even if it would be ok to promote such an instance, why not increasing
the promote timeout to a fixed safe value?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 Regards
 Erkan
 
 
 
 
 






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD Split brain question

2011-12-20 Thread Andreas Kurz
Hello,

On 12/20/2011 02:47 PM, Ulrich Windl wrote:
 Hi!
 
 I have a dual-primary DRBD that is not working well: It was working, then I 
 shut it down and restarted it. DRBD complained about split brain and fenced 
 the other node. When coming up, the other node fenced this node. IMHO no node 
 should have fenced each other.
 

no config from drbd, no cluster config, partial/filtered logs ...
fragments ... you have _all_ information and can't find the problem ...
sorry, but I can't see how anyone can help you based on that information.

I personally think it is part of the free community support deal to
share as much information as possible if one wants help for free.

Regards,
Andreas

-- 
Need help with Pacemaker or DRBD?
http://www.hastexo.com/now

 Here are the logs from both nodes, restricted to DRBD:
 
 Dec 20 14:22:01 h06 kernel: [339936.743323] block drbd0: Starting worker 
 thread (from cqueue [13353])
 Dec 20 14:22:01 h06 kernel: [339936.743452] block drbd0: disk( Diskless - 
 Attaching )
 Dec 20 14:22:01 h06 kernel: [339936.767174] block drbd0: Found 4 transactions 
 (6 active extents) in activity log.
 Dec 20 14:22:01 h06 kernel: [339936.767178] block drbd0: Method to ensure 
 write ordering: barrier
 Dec 20 14:22:01 h06 kernel: [339936.767185] block drbd0: drbd_bm_resize 
 called with capacity == 1048472
 Dec 20 14:22:01 h06 kernel: [339936.767194] block drbd0: resync bitmap: 
 bits=131059 words=2048 pages=4
 Dec 20 14:22:01 h06 kernel: [339936.767197] block drbd0: size = 512 MB 
 (524236 KB)
 Dec 20 14:22:01 h06 kernel: [339936.773015] block drbd0: bitmap READ of 4 
 pages took 2 jiffies
 Dec 20 14:22:01 h06 kernel: [339936.773032] block drbd0: recounting of set 
 bits took additional 0 jiffies
 Dec 20 14:22:01 h06 kernel: [339936.773035] block drbd0: 0 KB (0 bits) marked 
 out-of-sync by on disk bit-map.
 Dec 20 14:22:01 h06 kernel: [339936.773041] block drbd0: disk( Attaching - 
 UpToDate )
 Dec 20 14:22:01 h06 kernel: [339936.773045] block drbd0: attached to UUIDs 
 8344B9D0C389D2DC::902F198E803AB8E3:902E198E803AB8E3
 Dec 20 14:22:01 h06 kernel: [339936.795343] block drbd0: conn( StandAlone - 
 Unconnected )
 Dec 20 14:22:01 h06 kernel: [339936.795395] block drbd0: Starting receiver 
 thread (from drbd0_worker [10322])
 Dec 20 14:22:01 h06 kernel: [339936.795452] block drbd0: receiver (re)started
 Dec 20 14:22:01 h06 kernel: [339936.795458] block drbd0: conn( Unconnected - 
 WFConnection )
 Dec 20 14:22:02 h06 kernel: [339937.490329] block drbd0: role( Secondary - 
 Primary )
 Dec 20 14:22:02 h06 kernel: [339937.490583] block drbd0: new current UUID 
 B95131C56A7C2935:8344B9D0C389D2DC:902F198E803AB8E3:902E198E803AB8E3
 Dec 20 14:22:02 h06 multipathd: drbd0: update path write_protect to '0' 
 (uevent)
 Dec 20 14:22:02 h06 kernel: [339937.537270] block drbd0: Handshake 
 successful: Agreed network protocol version 96
 Dec 20 14:22:02 h06 kernel: [339937.537278] block drbd0: conn( WFConnection 
 - WFReportParams )
 Dec 20 14:22:02 h06 kernel: [339937.537335] block drbd0: Starting asender 
 thread (from drbd0_receiver [10344])
 Dec 20 14:22:02 h06 kernel: [339937.537725] block drbd0: data-integrity-alg: 
 not-used
 Dec 20 14:22:02 h06 kernel: [339937.543391] block drbd0: drbd_sync_handshake:
 Dec 20 14:22:02 h06 kernel: [339937.543394] block drbd0: self 
 B95131C56A7C2935:8344B9D0C389D2DC:902F198E803AB8E3:902E198E803AB8E3 bits:0 
 flags:
 0
 Dec 20 14:22:02 h06 kernel: [339937.543397] block drbd0: peer 
 3778E40F06BD4779:8344B9D0C389D2DC:902F198E803AB8E2:902E198E803AB8E3 bits:0 
 flags:
 0
 Dec 20 14:22:02 h06 kernel: [339937.543399] block drbd0: uuid_compare()=100 
 by rule 90
 Dec 20 14:22:02 h06 kernel: [339937.543403] block drbd0: helper command: 
 /sbin/drbdadm initial-split-brain minor-0
 Dec 20 14:22:02 h06 kernel: [339937.546011] block drbd0: helper command: 
 /sbin/drbdadm initial-split-brain minor-0 exit code 0 (0x0)
 Dec 20 14:22:02 h06 kernel: [339937.546015] block drbd0: Split-Brain detected 
 but unresolved, dropping connection!
 Dec 20 14:22:02 h06 kernel: [339937.546018] block drbd0: helper command: 
 /sbin/drbdadm split-brain minor-0
 Dec 20 14:22:02 h06 kernel: [339937.551050] block drbd0: meta connection shut 
 down by peer.
 Dec 20 14:22:02 h06 kernel: [339937.551056] block drbd0: conn( WFReportParams 
 - NetworkFailure )
 Dec 20 14:22:02 h06 kernel: [339937.551065] block drbd0: asender terminated
 Dec 20 14:22:02 h06 kernel: [339937.551067] block drbd0: Terminating asender 
 thread
 Dec 20 14:22:02 h06 kernel: [339937.586136] block drbd0: helper command: 
 /sbin/drbdadm split-brain minor-0 exit code 0 (0x0)
 Dec 20 14:22:02 h06 kernel: [339937.586146] block drbd0: conn( NetworkFailure 
 - Disconnecting )
 Dec 20 14:22:02 h06 kernel: [339937.586152] block drbd0: error receiving 
 ReportState, l: 4!
 Dec 20 14:22:02 h06 kernel: [339937.586211] block drbd0: Connection closed
 Dec 20 14:22:02 h06 kernel: [339937.586217] block drbd0: conn( Disconnecting 
 - 

Re: [Linux-HA] Antw: Re: OCFS on top of dual-primary DRBD in SLES11 SP1

2011-12-19 Thread Andreas Kurz
On 12/19/2011 09:15 AM, Ulrich Windl wrote:
 Andreas Kurz andr...@hastexo.com schrieb am 16.12.2011 um 14:01 in 
 Nachricht
 4eeb412e.9010...@hastexo.com:
 Hello Ulrich,

 On 12/16/2011 01:31 PM, Ulrich Windl wrote:
 Hi!

 I have some troubel with OCFS on top of DRBD that seems to be 
 timing-related:
 OCFS is working on the DRBD when DRBD itself wants to vhange something it 
 seems:

 can we see your cib and your full drbd cofniguration please ...
 
 It's somewhat complex, and I may not show you everything, sorry for that.

no problem ... you asked for help on a public mailing-list ...

 


 ...
 Dec 16 11:39:58 h06 kernel: [  122.426174] block drbd0: role( Secondary - 
 Primary )
 Dec 16 11:39:58 h06 multipathd: drbd0: update path write_protect to '0' 
 (uevent)
 Dec 16 11:40:29 h06 ocfs2_controld: start_mount: uuid 
 FD32E504527742CEA7DA6DB272D5D7B2, device /dev/drbd_r0, service ocfs2
 ...
 Dec 16 11:40:29 h06 kernel: [  152.837615] block drbd0: peer( Secondary - 
 Primary )
 Dec 16 11:40:29 h06 ocfs2_hb_ctl[19177]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -P 
 -d /dev/drbd_r0
 Dec 16 11:43:50 h06 kernel: [  354.559240] block drbd0: State change 
 failed: Device is held open by someone
 Dec 16 11:43:50 h06 kernel: [  354.559244] block drbd0:   state = { 
 cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r- }
 Dec 16 11:43:50 h06 kernel: [  354.559246] block drbd0:  wanted = { 
 cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r- }
 Dec 16 11:43:50 h06 drbd[28754]: [28786]: ERROR: r0: Called drbdadm -c 
 /etc/drbd.conf secondary r0
 Dec 16 11:43:50 h06 drbd[28754]: [28789]: ERROR: r0: Exit code 11

 A little bit later DRBD did it's own fencing (the machine rebooted)

 do you have logs to confirm this?
 
 Naturally no, as the commands echo b  /proc/sysrq-trigger ; reboot -f 
 don't actually write nice log messages.

All those nice drbd notify scripts do send mails, at least to local root
account. Additionally they try to log via syslog as well as DRBD does on
executing the handler ... so you have a good chance to get some
information if DRBD triggers that reboot ... at least if you are doing
remote syslogging.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Regards,
 Ulrich
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] OCFS on top of dual-primary DRBD in SLES11 SP1

2011-12-16 Thread Andreas Kurz
Hello Ulrich,

On 12/16/2011 01:31 PM, Ulrich Windl wrote:
 Hi!
 
 I have some troubel with OCFS on top of DRBD that seems to be timing-related:
 OCFS is working on the DRBD when DRBD itself wants to vhange something it 
 seems:

can we see your cib and your full drbd cofniguration please ...

 
 ...
 Dec 16 11:39:58 h06 kernel: [  122.426174] block drbd0: role( Secondary - 
 Primary )
 Dec 16 11:39:58 h06 multipathd: drbd0: update path write_protect to '0' 
 (uevent)
 Dec 16 11:40:29 h06 ocfs2_controld: start_mount: uuid 
 FD32E504527742CEA7DA6DB272D5D7B2, device /dev/drbd_r0, service ocfs2
 ...
 Dec 16 11:40:29 h06 kernel: [  152.837615] block drbd0: peer( Secondary - 
 Primary )
 Dec 16 11:40:29 h06 ocfs2_hb_ctl[19177]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -P 
 -d /dev/drbd_r0
 Dec 16 11:43:50 h06 kernel: [  354.559240] block drbd0: State change failed: 
 Device is held open by someone
 Dec 16 11:43:50 h06 kernel: [  354.559244] block drbd0:   state = { 
 cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate r- }
 Dec 16 11:43:50 h06 kernel: [  354.559246] block drbd0:  wanted = { 
 cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate r- }
 Dec 16 11:43:50 h06 drbd[28754]: [28786]: ERROR: r0: Called drbdadm -c 
 /etc/drbd.conf secondary r0
 Dec 16 11:43:50 h06 drbd[28754]: [28789]: ERROR: r0: Exit code 11
 
 A little bit later DRBD did it's own fencing (the machine rebooted)

do you have logs to confirm this?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Is there a way to let the cluster do the fencing instead of writing to 
 sysctl? Those handlers are used:
 handlers {
 pri-on-incon-degr /usr/lib/drbd/notify-pri-on-incon-degr.sh; 
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; 
 reboot -f;
 pri-lost-after-sb /usr/lib/drbd/notify-pri-lost-after-sb.sh; 
 /usr/lib/drbd/notify-emergency-reboot.sh; echo b  /proc/sysrq-trigger ; 
 reboot -f;
 local-io-error /usr/lib/drbd/notify-io-error.sh; 
 /usr/lib/drbd/notify-emergency-shutdown.sh; echo o  /proc/sysrq-trigger ; 
 halt -f;
 
 Regards,
 Ulrich
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] disconnecting network of any node cause both nodes fenced

2011-12-05 Thread Andreas Kurz
On 12/05/2011 12:21 PM, Muhammad Sharfuddin wrote:
 
 On Sun, 2011-12-04 at 23:47 +0100, Andreas Kurz wrote:
 Hello,

 On 12/04/2011 09:29 PM, Muhammad Sharfuddin wrote:
 This cluster reboots(fenced) both nodes, if I disconnects network of any
 nodes(simulating network failure).

 Completely loss of network is indistinguishable for a cluster node to a
 dead peer.


 I want that if any node disconnects from network, resources running on
 that node should be moved/migrate 
 to the other node(network connected node)

 Use ping RA for connectivity checks and use location constraints to move
 resources according to network connectivity (to external ping targets)

 so in case of having a ping RA with appropriate location rule, does at
 least make sure that if any one node lose the network connectivity(i.e
 both nodes cant see each other, while only one node is disconnected from
 network), the other healthy node(network connected node) wont reboot ...
 is it what you said ? 

No ... in case of service network loss of one node, resources can move
to the other node if it has a better connectivity. For this to work, the
nodes still need an extra communication path.

  

 How can I prevent this cluster to reboot(fence) the healthy node(i.e the
 node whose network is up/available/connected).

 Multiple-failure scenarios are challenging and possible solutions for a
 cluster are limited. With enough effort by an administrator every
 cluster can be tested to death.

 You can only minimize the possibility of a split-brain:
  
 * use redundant cluster communication paths (limited to two with corosync)
 in my test I disconnected every communication path of one node(and both
 rebooted)

Did you clone the sbd resource? If yes, don't do that. Start it as a
primitive, so in case of a split brain at least one node needs to start
the stonith resource which should give the other node an advantage ...
adding a start-delay should further increase that advantage.

 
 * at least one communication path is direct connected
 directly connected communication path and ping RA with location rule..
 will prevent the reboot of healthy node(network connected node)

No, don't use the other node as ping target ... that's ccm business ...
directly connected networks are simply less error-prone than switched
networks ... except for administrative interventions ;-)

 
 * use a quorum node

 i.e I should add another node(quorum node) in this two node cluster.

Yes ... you can add a node in permanent standby mode or starting
corosync without pacemaker should also work fine.

 
 If you are using a network connected fencing device use this network
 also for cluster communication.

 To prevent stonith death matches use power-off as stonith action or/and
 don't start cluster services on system startup.

 cluster does not start at system startup

fine

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 Regards,
 Andreas

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 --
 Regards,
 
 Muhammad Sharfuddin
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] disconnecting network of any node cause both nodes fenced

2011-12-04 Thread Andreas Kurz
Hello,

On 12/04/2011 09:29 PM, Muhammad Sharfuddin wrote:
 This cluster reboots(fenced) both nodes, if I disconnects network of any
 nodes(simulating network failure).

Completely loss of network is indistinguishable for a cluster node to a
dead peer.

 
 I want that if any node disconnects from network, resources running on
 that node should be moved/migrate 
 to the other node(network connected node)

Use ping RA for connectivity checks and use location constraints to move
resources according to network connectivity (to external ping targets)

 
 How can I prevent this cluster to reboot(fence) the healthy node(i.e the
 node whose network is up/available/connected).

Multiple-failure scenarios are challenging and possible solutions for a
cluster are limited. With enough effort by an administrator every
cluster can be tested to death.

You can only minimize the possibility of a split-brain:

* use redundant cluster communication paths (limited to two with corosync)
* at least one communication path is direct connected
* use a quorum node

If you are using a network connected fencing device use this network
also for cluster communication.

To prevent stonith death matches use power-off as stonith action or/and
don't start cluster services on system startup.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 I am using following STONITH resource
  primitive sbd_stonith stonith:external/sbd \
   meta target-role=Started \
   op monitor interval=3000 timeout=120 \
   op start interval=0 timeout=120 \
   op stop interval=0 timeout=120 \
   params
 sbd_device=/dev/disk/by-id/scsi-360080e50002377b802ff4e4bc873
 
 
 --
 Regards,
 
 Muhammad Sharfuddin
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Light Weight Quorum Arbitration

2011-12-03 Thread Andreas Kurz
Hello Eric,

On 12/03/2011 02:36 PM, Robinson, Eric wrote:
 I have a geographically dispersed (stretch) cluster, where one node is
 in data center A and the other node is in data center B. I have done
 everything possible to ensure link redundancy between the cluster nodes.
 Each node has 4 x gigabit links connected to 4 different sets of
 switches and routers that connect the two data centers. The data centers
 are connected over two dual counter-rotating SONET rings. 
  
 
 That said, the possibility remains that the links between the two data
 centers could be severed, leading to cluster partition. Is there a way
 to provide another quorum vote or something equivalent from a third
 location out on the Internet without having a full cluster node out
 there? Florian mentioned earlier that a full cluster node would probably
 not work well because of the bandwidth and latencies involved. What I
 really want is some kind of lightweight arbiter or quorum daemon at the
 third location. I've looked around but have not seen anything like that.
 Does anyone have any ideas? I've thought of trying to roll my own using
 ssh and shell scripts. 

the concept of an arbitrator for split-site cluster is already
implemented and should be available with Pacemaker 1.1.6 though it seem
to be not directly documented ... beside source code and this draft
document:

http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.geo.html

I think this is exactly what you are looking for.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 --
 Eric Robinson
  
  
  
  
  
 
 
 Disclaimer - December 3, 2011 
 This email and any files transmitted with it are confidential and intended 
 solely for linux-ha@lists.linux-ha.org. If you are not the named addressee 
 you should not disseminate, distribute, copy or alter this email. Any views 
 or opinions presented in this email are solely those of the author and might 
 not represent those of Physicians' Managed Care or Physician Select 
 Management. Warning: Although Physicians' Managed Care or Physician Select 
 Management has taken reasonable precautions to ensure no viruses are present 
 in this email, the company cannot accept responsibility for any loss or 
 damage arising from the use of this email or attachments. 
 This disclaimer was added by Policy Patrol: http://www.policypatrol.com/
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Custom resource agent script assistance

2011-12-01 Thread Andreas Kurz
Hello Chris,

On 12/01/2011 06:25 PM, Chris Bowlby wrote:
 Hi Everyone, 
 
 I'm in the process of configuring a 2 node + DRBD enabled DHCP cluster
 using the following packages:
 
 SLES 11 SP1, with Pacemaker 1.1.6, corosync 1.4.2, and drbd 8.3.12.
 
 I know about DHCP's internal fail-over abilities, but after testing, it
 simply failed to remain viable as a more robust HA type cluster. As such
 I began working on this solution. For reference my current configuration
 looks like this:
 
 node dhcp-vm01 \
 attributes standby=off
 node dhcp-vm02 \
 attributes standby=on
 primitive DHCPFS ocf:heartbeat:Filesystem \
 params device=/dev/drbd1 directory=/var/lib/dhcp
 fstype=ext4 \
 meta target-role=Started
 primitive dhcp-cluster ocf:heartbeat:IPaddr2 \
 params ip=xxx.xxx.xxx.xxx cidr_netmask=32 \
 op monitor interval=10s
 primitive dhcpd_service ocf:heartbeat:dhcpd \
 params dhcpd_config=/etc/dhcpd.conf \
   dhcpd_interface=eth0 \
 op monitor interval=1min \
 meta target-role=Started
 primitive dhcpdrbd ocf:linbit:drbd \
 params drbd_resource=dhcpdata \
 op monitor interval=60s
 ms DHCPData dhcpdrbd \
 meta master-max=1 master-node-max=1 clone-max=2
 clone-node-max=1 notify=true
 colocation dhcpd_service-with_cluster_ip inf: dhcpd_service dhcp-cluster
 colocation fs_on_drbd inf: DHCPFS DHCPData:Master
 order DHCP-after-dhcpfs inf: DHCPFS:promote dhcpd_service:start
 order dhcpfs_after_dhcpdata inf: DHCPData:promote DHCPFS:start

DHCPFS:promote ?? .. that action will never occour, so dhcpd_service
will start whenever it likes ... typically not when it should ;-)

... remove that :promote ... and you miss a colocation between
dhcpd_service and it's file system.

I'd suggest using a group and colocate/order that with DRBD:

group g_dhcp DHCPFS dhcpd_service dhcp-cluster

.. or IP before dhcp if it needs to bind to it

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 property $id=cib-bootstrap-options \
 dc-version=1.1.5-ecb6baaf7fc091b023d6d4ba7e0fce26d32cf5c8 \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore
 rsc_defaults $id=rsc-options \
 resource-stickiness=100
 
 The floating IP works without issue, as does the DRBD integration such
 that if I put a node into standby, the IP, DRBD master/slave and FS
 mounts all transfer correctly. Only the DHCP component itself is
 failing, in that it wont start properly from within pacemaker. 
 
 I suspect it is due to having to write a new script as I could not find
 an existing DHCPD RA agent anywhere. I built my own based off the
 development guide for resource agents on the wiki. I've managed to get
 it to complete all the tests I need it to pass in the ocf-tester script:
 
 ocf-tester -n dhcpd -o
 monitor_client_interface=eth0 /usr/lib/ocf/resource.d/heartbeat/dhcpd
 Beginning tests for /usr/lib/ocf/resource.d/heartbeat/dhcpd...
 * Your agent does not support the notify action (optional)
 * Your agent does not support the demote action (optional)
 * Your agent does not support the promote action (optional)
 * Your agent does not support master/slave (optional)
 /usr/lib/ocf/resource.d/heartbeat/dhcpd passed all tests
 
 Additionally if I run each of the various options
 (start/stop/monitor/validate-all/status/meta-data) at the command line,
 they all work with out issue, and stop/start the DHCPD process as
 expected.
 
 dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat # ps aux | grep dhcp
 root 12516  0.0  0.1   4344   756 pts/3S+   17:16   0:00 grep
 dhcp
 dhcp-vm01:/usr/lib/ocf/resource.d/heartbeat
 # /usr/lib/ocf/resource.d/heartbeat/dhcpd start
 DEBUG: Validating the dhcpd binary exists.
 DEBUG: Validating that we are running in chrooted mode
 DEBUG: Chrooted mode is active, testing the chrooted path exists
 DEBUG: Checking to see if the /var/lib/dhcp//etc/dhcpd.conf exists and
 is readable
 DEBUG: Validating the dhcpd user exists
 DEBUG: Validation complete, everything looks good.
 DEBUG: Testing the state of the daemon itself
 DEBUG: OCF_NOT_RUNNING: 7
 INFO: The dhcpd process is not running
 Internet Systems Consortium DHCP Server V3.1-ESV
 Copyright 2004-2010 Internet Systems Consortium.
 All rights reserved.
 For info, please visit https://www.isc.org/software/dhcp/
 WARNING: Host declarations are global.  They are not limited to the
 scope you declared them in.
 Not searching LDAP since ldap-server, ldap-port and ldap-base-dn were
 not specified in the config file
 Wrote 0 deleted host decls to leases file.
 Wrote 0 new dynamic host decls to leases file.
 Wrote 0 leases to leases file.
 Listening on LPF/eth0/00:0c:29:d7:64:99/SERVERS
 Sending on   LPF/eth0/00:0c:29:d7:64:99/SERVERS
 Sending on   Socket/fallback/fallback-net
 0
 INFO: dhcpd [chrooted] has started.
 DEBUG: Resource Agent Exit Status 0
 DEBUG: default 

Re: [Linux-HA] Q: RA reload

2011-11-30 Thread Andreas Kurz
On 11/30/2011 12:58 PM, Ulrich Windl wrote:
 Hi,
 
 when changing the performce-related-only mount option for a filesystem I 
 noticed that the LRM decided to restart the resource and all the depending 
 resources.
 
 As I know that Linux supports -o remount, such a restart would not be 
 necessary.
 
 So I wonder: When ever will the LRM decide to try a reload method (assuming 
 the RA has one)?
 
 A pointer to the documentation would be OK.

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained/index.html#s-reload

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Regards,
 Ulrich
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Pacemaker : how to modify configuration ?

2011-11-29 Thread Andreas Kurz
On 11/29/2011 10:01 AM, alain.mou...@bull.net wrote:
 Too long to explain : but in short it is for maintenance team of a 
 cluster to be able
 to temporarily avoid fencing due to the Flush delayed problem in case of 
 resources
 relocate, which randomly leads to monitoring failed and therefore fence 
 ... whereas
 it is not a valid error, it is a bug (I've opened another thread on this 
 ML about this Flush delayed problem)

sounds like you also want to implement ACLs ... to limit maintenance
team members to only modify special parts of your config.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 Alain
 
 
 
 De :RaSca ra...@miamammausalinux.org
 A : General Linux-HA mailing list linux-ha@lists.linux-ha.org
 Cc :alain.mou...@bull.net
 Date :  29/11/2011 09:35
 Objet : Re: [Linux-HA] Pacemaker : how to modify configuration ?
 
 
 
 Il giorno Mar 29 Nov 2011 09:30:41 CET, alain.mou...@bull.net ha scritto:
 Hi
 Yes I know it is possible this way, but I don't like to tell anybody to
 use crm configure edit because it is a command a little bit risky, 
 risk
 of corruption of the file ... when I'm the person who operates, I often 
 use crm configure edit, but I'm a little reluctant to tell somebody 
 else
 not really a pacemaker specialist to use this command. 
 So I'd prefer a command with cibadmin/grep/sed as Andrew suggest it.
 Thanks
 Alain
 
 Consider that a bad configuration is not being processed by the crm
 editor. In addition it is possible to do a dump of the actual
 configuration before doing any modifications.
 That said... If you're reluctant to make a non specialist users modify
 the configuration, then why let them modify delicate parameters such as
 on-fail?
 




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] is it good to create order constraint for sbd resource

2011-11-29 Thread Andreas Kurz
On 11/29/2011 09:16 PM, Tim Serong wrote:
 On 11/29/2011 04:28 PM, Muhammad Sharfuddin wrote:
 On Mon, 2011-11-28 at 21:47 +0100, Tim Serong wrote:
 On 11/28/2011 06:54 PM, Muhammad Sharfuddin wrote:
 is it good/required to create order constraint for sbd resource

 I am using following fencing resource:

 primitive sbd_stonith stonith:external/sbd \
meta target-role=Started \
op monitor interval=3000 timeout=120 \
op start interval=0 timeout=120 \
op stop interval=0 timeout=120 \
 params
 sbd_device=/dev/disk/by-id/scsi-360080e50002377b802ff4e4bc873

 I have following order constraints:

 order resA-before-resB inf: resA resB symmetrical=true
 order resB-before-resC inf: resB resC symmetrical=true

 should I also create another constraint for sbd like:

 order sbd_stonith-before-resA inf: sbd_stonith resA symmetrical=true

 please help/suggest.

 No.  The STONITH resource doesn't need to be running in order for your
 other resources to be operable (hence no need for an order constraint).

 Regards,

 Tim
 true, but if there is an order constraint for the STONITH resource, then
 it will at least make it sure that no other resource will be start
 before the STONITH resource.

 e.g:
 order sbd_stonith-before-resA inf: sbd_stonith resA symmetrical=true

 order resA-before-resB inf: resA resB symmetrical=true
 order resB-before-resC inf: resB resC symmetrical=true

 Because I stopped all the resources including STONITH resource(and
 stopping any resource sets the 'target-role=Stopped'), then started
 all other resources else/except the STONITH resource, so at that time my
 cluster has no fencing resource available.
 
 So, don't stop the STONITH resource :)

And if you do it you are on your own ... you want it? you get it! :-)
There are a lot of ways for an administrator to lower service downtime ...

 
 Side point: if you use crm configure property stop-all-resources=true, 
 this will stop all resources *except* for any STONITH resources.  The 
 point being, you do always want them running...
 
 So in order to protect the cluster I thought that there should(must) be
 an order constraint that specifies that no other resource(s) will be
 start if STONITH resource is stopped/unavailable.

 Please suggest/recommend
 
 You should generally be OK without order constraints on STONITH 
 resources.  I don't recall seeing any other systems where people had 
 created these constraints.  I should also note that if, say, your 
 STONITH resource is running on node-0 and that node dies, the cluster 
 will start the STONITH resource on node-1, to kill node-0.  It's smart 
 enough.
 
 Worst case, if your STONITH resource is completely broken, and a node 
 fails and thus can't be killed, the cluster will sit there and log 
 errors to syslog about its inability to kill the misbehaving node.
 
 (Question for everyone else: did I miss anything?)

IIRC stonith resources are always started first and stopped last anyways
... without extra constraints ... implicitly. Please someone correct me
if I'm wrong.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Regards,
 
 Tim





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Stonith SBD not fencing nodes

2011-11-28 Thread Andreas Kurz
On 11/28/2011 08:07 PM, Hal Martin wrote:
 Thank you for the updated link.
 
 I have recompiled pacemaker from checkout b9889764 and stonith still
 fails to shoot nodes.

Maybe posting also the logs from sdgxen-3 can help.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 sdgxen-2:/ # crm node fence sdgxen-3
 Do you really want to shoot sdgxen-3? y
 
 Syslog from sdgxen-2:
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node
 sdgxen-3 will be fenced because termination was requested
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN:
 determine_online_status: Node sdgxen-3 is unclean
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: stage6: Scheduling Node
 sdgxen-3 for STONITH
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: LogActions: Leave
 stonith-sbd(Started sdgxen-2)
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_state_transition: State
 transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
 cause=C_IPC_MESSAGE origin=handle_response ]
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: unpack_graph: Unpacked
 transition 4: 4 actions in 4 synapses
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_te_invoke: Processing
 graph 4 (ref=pe_calc-dc-1322492480-29) derived from
 /var/lib/pengine/pe-warn-1278.bz2
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_pseudo_action: Pseudo
 action 5 fired and confirmed
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_fence_node: Executing
 reboot fencing operation (8) on sdgxen-3 (timeout=6)
 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info:
 initiate_remote_stonith_op: Initiating remote operation reboot for
 sdgxen-3: 76727be7-eecb-4778-857c-1a9288c63ee6
 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info:
 can_fence_host_with_device: stonith-sbd can not fence sdgxen-3:
 dynamic-list
 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: stonith_command:
 Processed st_query from sdgxen-2: rc=0
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: process_pe_message:
 Transition 4: WARNINGs found during PE processing. PEngine Input
 stored in: /var/lib/pengine/pe-warn-1278.bz2
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: process_pe_message:
 Configuration WARNINGs found during PE processing.  Please run
 crm_verify -L to identify issues.
 Nov 28 15:01:26 sdgxen-2 stonith-ng: [452]: ERROR:
 remote_op_query_timeout: Query 76727be7-eecb-4778-857c-1a9288c63ee6
 for sdgxen-3 timed outNov 28 15:01:26 sdgxen-2 stonith-ng: [452]:
 ERROR: remote_op_timeout: Action reboot
 (76727be7-eecb-4778-857c-1a9288c63ee6) for sdgxen-3 timed outNov 28
 15:01:26 sdgxen-2 stonith-ng: [452]: info: remote_op_done: Notifing
 clients of 76727be7-eecb-4778-857c-1a9288c63ee6 (reboot of sdgxen-3
 from ee8c34db-0e5d-4227-aa46-0ad8b3f306d1 by (null)): 0, rc=-8Nov 28
 15:01:26 sdgxen-2 stonith-ng: [452]: info: stonith_notify_client:
 Sending st_fence-notification to client
 457/67849bf4-1881-48b9-a5e8-ab1f72116a81Nov 28 15:01:26 sdgxen-2 crmd:
 [457]: info: tengine_stonith_callback: StonithOp remote-op state=0
 st_target=sdgxen-3 st_op=reboot /Nov 28 15:01:26 sdgxen-2 crmd:
 [457]: info: tengine_stonith_callback: Stonith operation
 2/8:4:0:bd203590-3295-4f31-a720-01760a5394e8: Operation timed out
 (-8)Nov 28 15:01:26 sdgxen-2 crmd: [457]: ERROR:
 tengine_stonith_callback: Stonith of sdgxen-3 failed (-8)... aborting
 transition.Nov 28 15:01:26 sdgxen-2 crmd: [457]: info:
 abort_transition_graph: tengine_stonith_callback:454 - Triggered
 transition abort (complete=0) : Stonith failedNov 28 15:01:26 sdgxen-2
 crmd: [457]: info: update_abort_priority: Abort priority upgraded from
 0 to 100Nov 28 15:01:26 sdgxen-2 crmd: [457]: info:
 update_abort_priority: Abort action done superceeded by restartNov 28
 15:01:26 sdgxen-2 crmd: [457]: ERROR: tengine_stonith_notify: Peer
 sdgxen-3 could not be terminated (reboot) by anyone for sdgxen-2
 (ref=76727be7-eecb-4778-857c-1a9288c63ee6): Operation timed outNov 28
 15:01:26 sdgxen-2 crmd: [457]: info: run_graph:
 Nov 28 15:01:26
 sdgxen-2 crmd: [457]: notice: run_graph: Transition 4 (Complete=2,
 Pending=0, Fired=0, Skipped=2, Incomplete=0,
 Source=/var/lib/pengine/pe-warn-1278.bz2): StoppedNov 28 15:01:26
 sdgxen-2 crmd: [457]: info: te_graph_trigger: Transition 4 is now
 completeNov 28 15:01:26 sdgxen-2 crmd: [457]: info:
 do_state_transition: State transition S_TRANSITION_ENGINE -
 S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
 origin=notify_crmd ]Nov 28 15:01:26 sdgxen-2 crmd: [457]: info:
 do_state_transition: All 2 cluster nodes are eligible to run
 resources.Nov 28 15:01:26 sdgxen-2 crmd: [457]: info: do_pe_invoke:
 Query 81: Requesting the current CIB: S_POLICY_ENGINENov 28 15:01:26
 sdgxen-2 crmd: [457]: info: do_pe_invoke_callback: Invoking the PE:
 query=81, ref=pe_calc-dc-1322492486-30, seq=240, quorate=1Nov 28
 15:01:26 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node sdgxen-3
 will be fenced because termination was requestedNov 28 15:01:26
 

Re: [Linux-HA] Antw: Re: Stonith SBD not fencing nodes

2011-11-28 Thread Andreas Kurz
On 11/29/2011 12:14 AM, Hal Martin wrote:
 Sorry; they were included in the previous email but it appears it was
 not properly spaced to be noticeable in the wall of text.

Indeed ... already there, sorry for the noise.

strange ... where does this timeout come from? I don't see an evidence
this fencing request ran for 60sec ...

Did you try to provoke a fencing action without using crm shell?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Syslog from sdgxen-3:
 Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_ais_dispatch:
 Update relayed from sdgxen-2
 Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_trigger_update:
 Sending flush op to all hosts for: terminate (true)
 Nov 28 15:01:20 sdgxen-3 attrd: [455]: notice: attrd_perform_update:
 Sent update 7: terminate=true
 Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node
 sdgxen-2 now has id: 2065306796
 Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: crm_new_peer: Node
 2065306796 is now known as sdgxen-2
 Nov 28 15:01:20 sdgxen-3 stonith-ng: [452]: info: stonith_command:
 Processed st_query from sdgxen-2: rc=0
 Nov 28 15:01:21 sdgxen-3 sbd: [442]: info: Latency: 1
 Nov 28 15:01:22 sdgxen-3 sbd: [442]: info: Latency: 1
 Nov 28 15:01:23 sdgxen-3 sbd: [442]: info: Latency: 1
 Nov 28 15:01:24 sdgxen-3 sbd: [442]: info: Latency: 1
 Nov 28 15:01:25 sdgxen-3 sbd: [442]: info: Latency: 1
 Nov 28 15:01:26 sdgxen-3 sbd: [442]: info: Latency: 1
 Nov 28 15:01:14 sdgxen-3 stonith-ng: [452]: info: stonith_command:
 Processed st_query from sdgxen-2: rc=0
 
 Thanks,
 -Hal
 
 On Mon, Nov 28, 2011 at 6:10 PM, Andreas Kurz andr...@hastexo.com wrote:
 On 11/28/2011 08:07 PM, Hal Martin wrote:
 Thank you for the updated link.

 I have recompiled pacemaker from checkout b9889764 and stonith still
 fails to shoot nodes.

 Maybe posting also the logs from sdgxen-3 can help.

 Regards,
 Andreas

 --
 Need help with Pacemaker?
 http://www.hastexo.com/now


 sdgxen-2:/ # crm node fence sdgxen-3
 Do you really want to shoot sdgxen-3? y

 Syslog from sdgxen-2:
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: pe_fence_node: Node
 sdgxen-3 will be fenced because termination was requested
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN:
 determine_online_status: Node sdgxen-3 is unclean
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: stage6: Scheduling Node
 sdgxen-3 for STONITH
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: LogActions: Leave
 stonith-sbd(Started sdgxen-2)
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_state_transition: State
 transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
 cause=C_IPC_MESSAGE origin=handle_response ]
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: unpack_graph: Unpacked
 transition 4: 4 actions in 4 synapses
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: do_te_invoke: Processing
 graph 4 (ref=pe_calc-dc-1322492480-29) derived from
 /var/lib/pengine/pe-warn-1278.bz2
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_pseudo_action: Pseudo
 action 5 fired and confirmed
 Nov 28 15:01:20 sdgxen-2 crmd: [457]: info: te_fence_node: Executing
 reboot fencing operation (8) on sdgxen-3 (timeout=6)
 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info:
 initiate_remote_stonith_op: Initiating remote operation reboot for
 sdgxen-3: 76727be7-eecb-4778-857c-1a9288c63ee6
 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info:
 can_fence_host_with_device: stonith-sbd can not fence sdgxen-3:
 dynamic-list
 Nov 28 15:01:20 sdgxen-2 stonith-ng: [452]: info: stonith_command:
 Processed st_query from sdgxen-2: rc=0
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: WARN: process_pe_message:
 Transition 4: WARNINGs found during PE processing. PEngine Input
 stored in: /var/lib/pengine/pe-warn-1278.bz2
 Nov 28 15:01:20 sdgxen-2 pengine: [456]: notice: process_pe_message:
 Configuration WARNINGs found during PE processing.  Please run
 crm_verify -L to identify issues.
 Nov 28 15:01:26 sdgxen-2 stonith-ng: [452]: ERROR:
 remote_op_query_timeout: Query 76727be7-eecb-4778-857c-1a9288c63ee6
 for sdgxen-3 timed outNov 28 15:01:26 sdgxen-2 stonith-ng: [452]:
 ERROR: remote_op_timeout: Action reboot
 (76727be7-eecb-4778-857c-1a9288c63ee6) for sdgxen-3 timed outNov 28
 15:01:26 sdgxen-2 stonith-ng: [452]: info: remote_op_done: Notifing
 clients of 76727be7-eecb-4778-857c-1a9288c63ee6 (reboot of sdgxen-3
 from ee8c34db-0e5d-4227-aa46-0ad8b3f306d1 by (null)): 0, rc=-8Nov 28
 15:01:26 sdgxen-2 stonith-ng: [452]: info: stonith_notify_client:
 Sending st_fence-notification to client
 457/67849bf4-1881-48b9-a5e8-ab1f72116a81Nov 28 15:01:26 sdgxen-2 crmd:
 [457]: info: tengine_stonith_callback: StonithOp remote-op state=0
 st_target=sdgxen-3 st_op=reboot /Nov 28 15:01:26 sdgxen-2 crmd:
 [457]: info: tengine_stonith_callback: Stonith operation
 2/8:4:0:bd203590-3295-4f31-a720-01760a5394e8: Operation timed out
 (-8)Nov 28 15:01:26 sdgxen-2 crmd: [457]: ERROR:
 tengine_stonith_callback: Stonith of sdgxen-3 failed (-8)... aborting

Re: [Linux-HA] Problem when installing Cluster Glue 1.0.8

2011-11-26 Thread Andreas Kurz
On 11/26/2011 11:39 PM, Lazaro Rubén García Martinez wrote:
 Hello everyone in this list, I am a new member in the list. I write you 
 because a need to install heartbeat, but I am new with this software.
 
 I try to install the Cluster Glue in a MACHINE WITH CentOS 6, but when I 
 execute the make command this error is shows:

is there a special reason for you to not use Pacemaker and rest of
cluster stack that are already shipped with CentOS6?

 
 gmake[2]: Entering directory 
 `/usr/local/src/heartbeat/Reusable-Cluster-Components-glue--7583026c6ace/doc'
 /usr/bin/xsltproc \
 --xinclude \
 
 http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl 
 hb_report.xml
 error : Operation in progress
 warning: failed to load external entity 
 http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl;
 cannot parse 
 http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl
 gmake[2]: *** [hb_report.8] Error 4
 gmake[2]: Leaving directory 
 `/usr/local/src/heartbeat/Reusable-Cluster-Components-glue--7583026c6ace/doc'
 gmake[1]: *** [all-recursive] Error 1
 gmake[1]: Leaving directory 
 `/usr/local/src/heartbeat/Reusable-Cluster-Components-glue--7583026c6ace/doc'
 make: *** [all-recursive] Error 1

all the build requirements installed? ... libxslt docbook-dtds
docbook-style-xsl help2man just to name some

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Can anybody  tell me how to solve this problem?
 
 Thank you very much for your time
 Regards, and sorry for my bad English.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] The master server of HA system suddently roll over

2011-11-22 Thread Andreas Kurz
On 11/22/2011 01:27 AM, tyo...@globalchoice.us wrote:
 Dear sirs:
 
 This is Yang. I set up 2 database server using heartbeat
 (heartbeat-2.1.3-3.el5.centos.rpm). They are running for over 40 days

Please consider an update to Corosync  Pacemaker ...

 very well. But suddently master Datebase server roll over to slave
 Database server. When I check the master server, everything is good.
 and I copy the ha-log as follows:
 
 heartbeat[16316]: 2011/11/21_10:46:40 info: killing
 /usr/local/bin/check_service process group 18673 with signal 15
 heartbeat[16316]: 2011/11/21_10:46:40 info: killing
 /usr/lib/heartbeat/mgmtd -v process group 16383 with signal 15
 mgmtd[16383]: 2011/11/21_10:46:40 info: mgmtd is shutting down
 heartbeat[16316]: 2011/11/21_10:46:40 info: killing
 /usr/lib/heartbeat/crmd process group 16382 with signal 15
 crmd[16382]: 2011/11/21_10:46:40 info: crm_shutdown: Requesting shutdown
 crmd[16382]: 2011/11/21_10:46:40 info: do_state_transition: State
 transition S_IDLE - S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN
 origin=crm_shutdown ]

This looks like someone simply shutdown Heartbeat ... if this is the
reason you should find clear indication in logs.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 crmd[16382]: 2011/11/21_10:46:40 info: do_state_transition: All 2 cluster
 nodes are eligible to run resources.
 crmd[16382]: 2011/11/21_10:46:40 info: do_shutdown_req: Sending shutdown
 request to DC: kanridb-master
 crmd[16382]: 2011/11/21_10:46:40 info: do_shutdown_req: Processing
 shutdown locally
 crmd[16382]: 2011/11/21_10:46:40 info: handle_shutdown_request: Creating
 shutdown request for kanridb-master
 tengine[16585]: 2011/11/21_10:46:40 info: extract_event: Aborting on
 shutdown attribute for 9134a5e8-6a99-4392-ae5b-06e6a05dd9c0
 tengine[16585]: 2011/11/21_10:46:40 info: update_abort_priority: Abort
 priority upgraded to 100
 pengine[16586]: 2011/11/21_10:46:40 info: determine_online_status: Node
 kanridb-master is shutting down
 pengine[16586]: 2011/11/21_10:46:40 info: determine_online_status: Node
 kanridb-slave is online
 pengine[16586]: 2011/11/21_10:46:40 notice: group_print: Resource Group:
 group_1
 .
 
 
Do you have someone can tell me why this master server have to roll
 over, what's wrong for this server at 2011/11/21_10:46:40.
 
 
 
 Thanks
 
 
 Yang
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] [Heartbeat][Pacemaker] VIP doesn't swith to other server

2011-11-18 Thread Andreas Kurz
Hello Mathieu,

On 11/17/2011 07:22 PM, SEILLIER Mathieu wrote:
 Hi all,
 
 I have to use Heartbeat with Pacemaker for High Availability between 2 Tomcat 
 5.5 servers under Linux RedHat 5.4.
 The first server is active, the other one is passive. The master is called 
 servappli01, with IP address 186.20.100.81, the slave is called servappli02, 
 with IP address 186.20.100.82.
 I configured a virtual IP 186.20.100.83. Each Tomcat is not launched when 
 server is started, this is Heartbeat which starts Tomcat when it's running.
 All seem to be OK, each server see the other as active, and the crm_mon 
 command shows this below :
 
 
 Last updated: Thu Nov 17 19:03:34 2011
 Stack: Heartbeat
 Current DC: servappli01 (bf8e9a46-8691-4838-82d9-942a13aeedca) - partition 
 with quorum
 Version: 1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 
 
 Online: [ servappli01 servappli02 ]
 
  Clone Set: ClusterIPClone (unique)
  ClusterIP:0(ocf::heartbeat:IPaddr2):   Started servappli01
  ClusterIP:1(ocf::heartbeat:IPaddr2):   Started servappli02

Your did not only configured a simple VIP but a cluster IP which acts
like a simple static loadbalancer ... man iptables ... search for CLUSTERIP.

If this was not your intention, simply don't clone it.

If you want a clusterip you have to choose correct meta attributes:

clone ClusterIPClone ClusterIP \
meta globally-unique=true clone-node-max=2 interleave=true

  Clone Set: TomcatClone (unique)
  Tomcat:0   (ocf::heartbeat:tomcat):Started servappli01
  Tomcat:1   (ocf::heartbeat:tomcat):Started servappli02
 
 
 The 2 Tomcat servers as identical, and the same webapps are deployed on each 
 server in order to be able to access webapps on the other server if one is 
 down.
 By default, requests from clients are processed by the first server because 
 it's the master.
 My problem is that when I crash the Tomcat on the first server, requests from 
 clients are not redirected to the second server. For a while, requests are 
 not processed, then Heartbeat restarts Tomcat itself and requests are 
 processed again by the first server.
 Requests are never forwarded to the second Tomcat if the first is down.

Default behavior on monitoring errors is a local restart. If you always
test from the same IP I would expect your requests to fail while Tomcat
is not running on the one node you are redirected ... so if you choose
the clusterip_hash sourceip-sourceport your chance should be 50/50 to
get redirected ... if you want a real loadbalancer you might want to
integrate a service likde ldirectord with realserver checks to remove a
non-working service from the loadbalancing.

... use ip addr show or define a label to see your VIP ...

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Here is my configuration :
 
 ha.cf file (the same on each server) :
 
 crm respawn
 logfacility local0
 logfile /var/log/ha-log
 debugfile   /var/log/ha-debug
 warntime10
 deadtime20
 initdead120
 keepalive   2
 autojoinnone
 nodeservappli01
 nodeservappli02
 ucast   eth0 186.20.100.81 # ignored by node1 (owner of ip)
 ucast   eth0 186.20.100.82 # ignored by node2 (owner of ip)
 
 cib.xml file (the same on each server) :
 
 ?xml version=1.0 ?
 cib admin_epoch=0 crm_feature_set=3.0.1 
 dc-uuid=bf8e9a46-8691-4838-82d9-942a13aeedca epoch=127 have-quorum=1 
 num_updates=51 validate-with=pacemaker-1.0
   configuration
 crm_config
   cluster_property_set id=cib-bootstrap-options
 nvpair id=cib-bootstrap-options-dc-version name=dc-version 
 value=1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87/
 nvpair id=cib-bootstrap-options-cluster-infrastructure 
 name=cluster-infrastructure value=Heartbeat/
 nvpair id=cib-bootstrap-options-expected-quorum-votes 
 name=expected-quorum-votes value=2/
 nvpair id=cib-bootstrap-options-no-quorum-policy 
 name=no-quorum-policy value=ignore/
 nvpair id=cib-bootstrap-options-stonith-enabled 
 name=stonith-enabled value=false/
   /cluster_property_set
 /crm_config
 nodes
   node id=489a0305-862a-4280-bce5-6defa329df3f type=normal 
 uname=servappli01/
   node id=bf8e9a46-8691-4838-82d9-942a13aeedca type=normal 
 uname=servappli02/
 /nodes
 resources
   clone id=TomcatClone
 meta_attributes id=TomcatClone-meta_attributes
   nvpair id=TomcatClone-meta_attributes-globally-unique 
 name=globally-unique value=true/
 /meta_attributes
 primitive class=ocf id=Tomcat provider=heartbeat type=tomcat
   instance_attributes id=Tomcat-instance_attributes
 nvpair id=Tomcat-instance_attributes-tomcat_name 
 name=tomcat_name value=TomcatSBNG/
 nvpair 

Re: [Linux-HA] Antw: Re: setting one resource of a group to unmanaged: undesired side-effects

2011-11-10 Thread Andreas Kurz
On 11/10/2011 11:48 AM, Ulrich Windl wrote:
 Andreas Kurz andr...@hastexo.com schrieb am 09.11.2011 um 15:48 in 
 Nachricht
 4eba92ad.6050...@hastexo.com:
 On 11/09/2011 01:22 PM, Ulrich Windl wrote:
 Hi!

 I have a question : If I have a resource group with pacemaker 
 (pacemaker-1.1.5-5.9.11.1 on SLES11 SP1) that has several resources, and I 
 set one resource to unmanaged, the group should not be affected, right?
 I also have a colocation like
 colocation col_ip2 inf: prm_ip2 cln_foo:Slave

 When I set prm_ip2 to unmanaged the clone (ms) resource cln_foo did a 
 promote (slave to master on node where prm_ip2 is, and a demote master to 
 slave on another node) action.
 The idea of the colocation was to have the IP address on the node where the 
 slave instance of cln_foo is running. I hope the setup was fine.

 Misconfiguration or software bug?

 Try using the whole group and not only the ip in the colocation
 constraint ... if the cluster sees no need to bind a resource to a
 specific node it feels free to relocate resources if scores ... or in
 this case promotion scores are the same.

 Though I am not sure if this behavior is intended for unmanaged
 resources in the case you described ... might be a feature ;-)

 Generally I suggest to use maintenance mode so you can be sure cluster
 does really nothing with any resource while you are doing your admin tasks.
 
 Good point: Can I activate maintenance mode with the crm shell? AFAIK, 
 Maintenance mode as done by the gui just puts all resources in to 
 unmanaged mode. Did I miss something?
 

crm configure property maintenance-mode=true

... puts really all resources in unmanaged mode regardless of individual
is-managed settings.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 Reagards,
 Ulrich
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: Re: Q: colocations for clone resources, transitivity

2011-11-10 Thread Andreas Kurz
On 11/09/2011 02:19 PM, Ulrich Windl wrote:
 Florian Haas flor...@hastexo.com schrieb am 09.11.2011 um 14:01 in 
 Nachricht
 4eba7997.9060...@hastexo.com:
 On 2011-11-09 13:36, Ulrich Windl wrote:
 Hi!

 I tried to co-locate a ocfs clone with a drbd ms-clone, and I tried to 
 co-locate a CTDB clone with the ocfs clone also. The idea was that the OCFS 
 filesystem is where the DRBD is, and the CTDB is where the filesystem is. So 
 actually that is a transitive colocation like: CTDB - OCFS - DRDB

 I guess CRM can't handle that even if CTDB is to be started before OCFS, 
 and OCFS before CTDB. Syslog has messages like: 
 notice: clone_rsc_colocation_rh: Cannot pair prm_DLM:0 with instance of 
 msc_drbd_r0
 notice: clone_rsc_colocation_rh: Cannot pair prm_ctdb:0 with instance of 
 cln_ocfs
 notice: clone_rsc_colocation_rh: Cannot pair prm_ctdb:1 with instance of 
 cln_ocfs

 rsc_colocation id=col_ocfs_on_drbd_r0 rsc=cln_ocfs score=INFINITY 
 with-rsc=msc_drbd_r0/
 rsc_colocation id=col_ctdb_ocfs rsc=cln_ctdb score=INFINITY 
 with-rsc=cln_ocfs/

 It would be easier to maintain if resources would not require a multi-level 
 colocation (like depending CTDB on DRBD and depending OCFS on DRBD).

 What is the easiest solution?

 Group them all, then clone the group.
 
 That would not work the general case, because you only need the ocfs 
 framework once, the drbd framework less often, and the filesytem more often. 
 Once you are having two DRBD devices ot two OCFS filesystems that won't work 
 well.
 

use resource-sets when groups are to inflexible

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 Ulrich
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] setting one resource of a group to unmanaged: undesired side-effects

2011-11-09 Thread Andreas Kurz
On 11/09/2011 01:22 PM, Ulrich Windl wrote:
 Hi!
 
 I have a question : If I have a resource group with pacemaker 
 (pacemaker-1.1.5-5.9.11.1 on SLES11 SP1) that has several resources, and I 
 set one resource to unmanaged, the group should not be affected, right?
 I also have a colocation like
 colocation col_ip2 inf: prm_ip2 cln_foo:Slave
 
 When I set prm_ip2 to unmanaged the clone (ms) resource cln_foo did a 
 promote (slave to master on node where prm_ip2 is, and a demote master to 
 slave on another node) action.
 The idea of the colocation was to have the IP address on the node where the 
 slave instance of cln_foo is running. I hope the setup was fine.
 
 Misconfiguration or software bug?

Try using the whole group and not only the ip in the colocation
constraint ... if the cluster sees no need to bind a resource to a
specific node it feels free to relocate resources if scores ... or in
this case promotion scores are the same.

Though I am not sure if this behavior is intended for unmanaged
resources in the case you described ... might be a feature ;-)

Generally I suggest to use maintenance mode so you can be sure cluster
does really nothing with any resource while you are doing your admin tasks.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Regards,
 Ulrich
 
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] md group take over?

2011-11-01 Thread Andreas Kurz
Hello,

On 11/01/2011 11:29 PM, Miles Fidelman wrote:
 I've seen a few references to a resource agent for md group take over 
 - but can't seem to find the actual agent or any documentation.

mean ocf:heartbeat:ManageRAID RA?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Anybody know if it's real?  Where to find more info?
 
 Thanks much,
 
 Miles Fidelman
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] SANs falling over, don't know why!

2011-10-31 Thread Andreas Kurz
 netmask to CIDR as:
 24 Oct 30 06:02:51 iscsi1cl6 last message repeated 2 times Oct 30
 06:02:52 iscsi1cl6 kernel: iscsi_trgt: Abort Task (01) issued on 
 tid:1
 lun:6 by sid:4222124721766912 (Function Complete) Oct 30 06:02:52
 iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) 
 Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:03:17 
 iscsi1cl6 last message repeated 24 times Oct 30 06:03:18 iscsi1cl6 kernel:
 iscsi_trgt: cmnd_rx_start(1849) 1 3b30 -7 Oct 30 06:03:18
 iscsi1cl6 kernel: iscsi_trgt: cmnd_skip_pdu(459) 3b30 1 2a 4096 
 Oct 30 06:03:18 iscsi1cl6 lrmd: [3770]: info: RA output:
 (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
 24 Oct 30 06:03:49 iscsi1cl6 last message repeated 30 times Oct 30
 06:04:32 iscsi1cl6 last message repeated 42 times Oct 30 06:04:33
 iscsi1cl6 cib: [3769]: info: cib_stats: Processed 1 operations 
 (1.00us average, 0% utilization) in the last 10min Oct 30 
 06:04:33
 iscsi1cl6 lrmd: [3770]: info: RA output: (ClusterIP:monitor:stderr) 
 Converted dotted-quad netmask to CIDR as: 24 Oct 30 06:05:04 
 iscsi1cl6 last message repeated 30 times Oct 30 06:05:41 iscsi1cl6 
 last message repeated 36 times Oct 30 06:05:42 iscsi1cl6 kernel: 
 iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by 
 sid:5629499605320192 (Function
 Complete)


 Regards,

 James

 -Original Message-
 From: linux-ha-boun...@lists.linux-ha.org
 [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of James Smith
 Sent: 30 October 2011 00:25
 To: General Linux-HA mailing list
 Subject: Re: [Linux-HA] SANs falling over, don't know why!

 Hi,

 Changed nothing to my knowledge :p

 These boxes don't currently have fencing enabled.  I imagine the reboot is 
 caused by a kernel panic, sysctl is set to reboot on this.

 There is one big 4TB LUN, used by several VMs on XenServer, each with 
 multiple disks.

 In my quest to resolve, I have changed iet to use fileio instead of blockio 
 and fiddled with some drbd performance related bits 
 (http://www.drbd.org/users-guide/s-latency-tuning.html).

 If I'm woken up again tonight with this thing breaking it's going in the 
 bin.  I'll probably also ditch ietd and try open-iscsi or iscsi-scst.  
 Monday morning I'll be shifting some load off this cluster also.

 Regards,

 James

 -Original Message-
 From: linux-ha-boun...@lists.linux-ha.org
 [mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Andreas 
 Kurz
 Sent: 29 October 2011 22:36
 To: linux-ha@lists.linux-ha.org
 Subject: Re: [Linux-HA] SANs falling over, don't know why!

 Hello,

 On 10/29/2011 08:47 PM, James Smith wrote:
 Hi,

 All of a sudden, a SAN pair which was running without any problems for six 
 months, now decides to fall over every couple of hours.

 So what did you change? ;-)


 The logs I have to go on are below:

 Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times Oct 29
 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
 tid:1
 lun:6 by sid:844424967684608 (Function Complete) Oct 29 19:09:24
 iscsi2cl6 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) 
 Converted dotted-quad netmask to CIDR as: 24 Oct 29 19:09:49
 iscsi2cl6 last message repeated 24 times Oct 29 19:09:49 iscsi2cl6 kernel:
 iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
 sid:1125899927618048 (Function Complete) Oct 29 19:09:49 iscsi2cl6
 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
 sid:1407374904328704 (Function Complete) Oct 29 19:09:49 iscsi2cl6
 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
 sid:281474997486080 (Function Complete) Oct 29 19:09:50 iscsi2cl6
 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
 dotted-quad netmask to CIDR as: 24 Oct 29 19:09:50 iscsi2cl6 kernel:
 iscsi_trgt: Abort Task (01) issued on tid:1 lun:6 by
 sid:562949974196736 (Function Complete) Oct 29 19:09:51 iscsi2cl6
 lrmd: [4677]: info: RA output: (ClusterIP:monitor:stderr) Converted 
 dotted-quad netmask to CIDR as: 24 Oct 29 19:09:53 iscsi2cl6 last 
 message repeated 2 times Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt:
 Abort Task (01) issued on tid:1 lun:6 by sid:844424967684608 
 (Function
 Complete) Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task
 (01) issued on tid:1 lun:6 by sid:844424967684608 (Function 
 Complete) Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output:
 (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as:
 24 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times Oct 29
 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on
 tid:1
 lun:6 by sid:1407374904328704 (Function Complete) Oct 29 19:10:06
 iscsi2cl6 last message repeated 4 times Oct 29 19:10:06 iscsi2cl6
 kernel: block drbd0: istiod1[4695] Concurrent local write detected!
 [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584 Oct 
 29
 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent 
 local write detected! [DISCARD L] new: 2077806184s +512

Re: [Linux-HA] SANs falling over, don't know why!

2011-10-29 Thread Andreas Kurz
Hello,

On 10/29/2011 08:47 PM, James Smith wrote:
 Hi,
 
 All of a sudden, a SAN pair which was running without any problems for six 
 months, now decides to fall over every couple of hours.

So what did you change? ;-)

 
 The logs I have to go on are below:
 
 Oct 29 19:09:23 iscsi2cl6 last message repeated 12 times
 Oct 29 19:09:23 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:844424967684608 (Function Complete)
 Oct 29 19:09:24 iscsi2cl6 lrmd: [4677]: info: RA output: 
 (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
 Oct 29 19:09:49 iscsi2cl6 last message repeated 24 times
 Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:1125899927618048 (Function Complete)
 Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:1407374904328704 (Function Complete)
 Oct 29 19:09:49 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:281474997486080 (Function Complete)
 Oct 29 19:09:50 iscsi2cl6 lrmd: [4677]: info: RA output: 
 (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
 Oct 29 19:09:50 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:562949974196736 (Function Complete)
 Oct 29 19:09:51 iscsi2cl6 lrmd: [4677]: info: RA output: 
 (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
 Oct 29 19:09:53 iscsi2cl6 last message repeated 2 times
 Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:844424967684608 (Function Complete)
 Oct 29 19:09:53 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:844424967684608 (Function Complete)
 Oct 29 19:09:54 iscsi2cl6 lrmd: [4677]: info: RA output: 
 (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
 Oct 29 19:10:05 iscsi2cl6 last message repeated 11 times
 Oct 29 19:10:06 iscsi2cl6 kernel: iscsi_trgt: Abort Task (01) issued on tid:1 
 lun:6 by sid:1407374904328704 (Function Complete)
 Oct 29 19:10:06 iscsi2cl6 last message repeated 4 times
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 2077806177s +3584; pending: 2077806177s +3584
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 2077806184s +512; pending: 2077806184s +512
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1693425337s +3584; pending: 1693425337s +3584
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1693425344s +512; pending: 1693425344s +512
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1693425321s +3584; pending: 1693425321s +3584
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1693425328s +512; pending: 1693425328s +512
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1693425313s +3584; pending: 1693425313s +3584
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1693425320s +512; pending: 1693425320s +512
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1743088585s +3584; pending: 1743088585s +3584
 Oct 29 19:10:06 iscsi2cl6 kernel: block drbd0: istiod1[4695] Concurrent local 
 write detected! [DISCARD L] new: 1743088592s +512; pending: 1743088592s +512

Concurrent local writes  Is there any kind of cluster software using
a shared quorum disk or sthg. like that using this lun? Or this lun
shared between several VMWare ESX VMs?

 Oct 29 19:10:06 iscsi2cl6 lrmd: [4677]: info: RA output: 
 (ClusterIP:monitor:stderr) Converted dotted-quad netmask to CIDR as: 24
 
 After this event, both members of the SAN pair reboot.  It is very 
 disruptive, as it's killing the VMs using this SAN, requiring fsck's after 
 failure.  The load on the SAN doesn't need to be very high for this happen.
 

They reboot because of a kernel panic, or because of some fencing mechanism?

 Running the following:
 
 CentOS 5 with kernel 2.6.18-274.7.1.el5
 IET 1.4.20.2
 Pacemaker 1.0.11-1.2.el5
 DRBD 8.3.11

Would be interesting to see Pacemaker/DRBD/IET config 

Regards,
Andreas
-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cib.xml missing on a cluster node

2011-10-28 Thread Andreas Kurz
On 10/28/2011 09:21 AM, Alessandra Giovanardi wrote:
 Hi,
 I solved the problem by using the hb_gui on the other node of the cluster.
 
 In that case the IPaddr resource shows the correct ocf/heartbeat 
 CLASS/PROVIDER (on the other node only heartbeat was reported) and the 
 new RG with the new IP are starting correctly.
 
 In effect seems a bug in the hb_gui...someone had the same problem on 
 DEBIAN 2.1.3?

this software is nearly four years old! ... do I have to say more? ;-)

 In your opinion can be a problem affecting the cluster functionality or 
 only the management?

I prefered manipulating an offline version of the cib when HB 2.1.3 was
in, validation checked and replaced current config if all was ok.

My opinion is: consider an update ... or hire someone that can assist you.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Thank you
 Bye
 Alessandra
 
 
 On 10/28/2011 12:32 AM, Alessandra Giovanardi wrote:
 On 10/28/2011 12:12 AM, Andreas Kurz wrote:
 Hello,

 On 10/27/2011 11:54 PM, Alessandra Giovanardi wrote:
 Hi,
 I followed your suggestion and all was ok thanks...both nodes update
 correclty your configuration and takeover works fine.

 Anyway, after those operations, we tried to add another RG to our
 cluster, named group_univdrupal_prod with only one IP and we had some
 problem.
 The RG is correctly created but at the start (we disabled stonith first)
 an error occurs. I paste you the log at the end.

 Why the resource_univdrupal_IP_1 seems unmanaged?
 For any reason it was unable to start and then to stop the IP on
 gicdrupal01 ...

 Quite strange since if I start this IP on gicdrupal01 with:
 ifconfig eth1:0 130.186.99.43 netmask 255.255.255.0 up

 the IP is correctly configured on the interface...
 So it seems a heartbeat problem

 The only strange think I observe is that the IP resource still present
 into the first RG was heartbeat::ocf:IPaddr, while the new is created as:
 heartbeat:IPaddr
 These are different resource-agents ... use heartbeat:ocf:IPaddr2 ...
 yes, the one with the 2 at the end ;-)

 I tried also IPaddr2 without success (also in that case the resource is
 created without the :ocf: field--  why?, from hb_gui the only choices
 are IPaddr and IPaddr2 with hearbeat and not ocf/heartbeat as
 Class/Provider:

pengine[27190]: 2011/10/27_22:35:53 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped
 pengine[27190]: 2011/10/27_22:35:53 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped
 pengine[27190]: 2011/10/27_22:35:54 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped
 pengine[27190]: 2011/10/27_22:36:05 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped
 pengine[27190]: 2011/10/27_22:36:05 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01
 FAILED
 pengine[27190]: 2011/10/27_22:36:06 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01
 (unmanaged) FAILED
 pengine[27190]: 2011/10/27_22:36:14 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01
 (unmanaged) FAILED
 pengine[27190]: 2011/10/27_22:36:18 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Stopped
 pengine[27190]: 2011/10/27_22:36:21 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01
 FAILED
 pengine[27190]: 2011/10/27_22:36:22 notice: native_print:
 resource_univdrupal_IP_1(heartbeat:IPaddr2):Started gicdrupal01
 (unmanaged) FAILE

 It sound me like a problem of the hb_gui, which does not compromise the
 cluster features, but does not permit the RG creation...
 Some other suggestions to by-pass the problem?
 even if I select from the hb_GUI the IPaddr heartbeat...
 Furthermore the heartbeat release version seems the last available into
 the DEBIAN lenny stable repository:

 ii  heartbeat
 2.1.3-6lenny4  Subsystem for High-Availability Linux
 ii  heartbeat-2
 2.1.3-6lenny4  Subsystem for High-Availability Linux
 ii  heartbeat-2-gui
 2.1.3-6lenny4  Provides a gui interface to manage heartbeat
 clusters
 ii  heartbeat-gui  2.1.3-6lenny4
 really, really, really consider an upgrade ... even if you are on lenny,
 use latest pacemaker backports packages.

 Regards,
 Andreas

 Is quite complicated for us ;-)

 Thanks
 A.

 
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] cib.xml missing on a cluster node

2011-10-28 Thread Andreas Kurz
On 10/28/2011 11:06 AM, Alessandra Giovanardi wrote:
 On 10/28/2011 10:35 AM, Andreas Kurz wrote:
 On 10/28/2011 09:21 AM, Alessandra Giovanardi wrote:
 Hi,
 I solved the problem by using the hb_gui on the other node of the cluster.

 In that case the IPaddr resource shows the correct ocf/heartbeat
 CLASS/PROVIDER (on the other node only heartbeat was reported) and the
 new RG with the new IP are starting correctly.

 In effect seems a bug in the hb_gui...someone had the same problem on
 DEBIAN 2.1.3?
 this software is nearly four years old! ... do I have to say more? ;-)

 right, but this cluster should be active only for the next three/four 
 months and then we migrate all services on a farm (load balanced 
 environment).
 Furthermore the services are very critical (no more than 5 minutes in 
 offline mode)...so we would minimize all change...

yes ... never touch a running system ;-)

 
 In your opinion can be a problem affecting the cluster functionality or
 only the management?
 I prefered manipulating an offline version of the cib when HB 2.1.3 was
 in, validation checked and replaced current config if all was ok.

 How could you validate the new cib (and also the current one?)?
 We need another cluster to test it?

No need for another cluster to use the tools ... of course it is a good
idea to have a test system before deploying a new config to critical
systems.

Try: cibadmin, crm_verify ... also xmllint might be helpful

 
 My opinion is: consider an update ... or hire someone that can assist you.

 Correct but as reported above this cluster should switch off in few months.

Good luck!

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 Thank you.
 A.
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Standard dlm_controld from cman

2011-10-27 Thread Andreas Kurz
hello,

On 10/27/2011 03:25 AM, Nick Khamis wrote:
 Hello Everyone,
 
 I just compiled the latest version of cman (cluster 3.1.7) for
 standard dlm_controld support.
 My setup is as such (compiled from source, in the order):
 
 Glue 1.0.8
 RA 3.9
 Corosync 1.4.2
 OpenAIS 1.1.4
 Cluster 3.1.7
 Pacemaker 1.1.6
 
 
 dlm_controld:
 
 /usr/local/src/cluster-3.1.7/group/dlm_controld
 /usr/local/src/cluster-3.1.7/group/dlm_controld/dlm_controld.h
 /usr/local/src/cluster-3.1.7/group/dlm_controld/dlm_controld
 
 But the controld RA still complains:
 
 controld[20623]: ERROR: Setup problem: couldn't find command: 
 dlm_controld.pcmk

IIRC there is no Pacemaker specific controld in Cluster 3.1.x any more,
this version is intended to be universal ... tried symlinking?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Issuing crm_report --features:
 
 1.1.6 - 9971ebba4494012a93c03b40a2c58ec0eb60f50c:  ncurses
 corosync-quorum corosync
 
 Do I need to re-install pacemaker for the cman dlm stuff to be
 included? This is for ocfs2.
 
 Thanks in Advance,
 
 Nick.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Basic Question about LVM

2011-10-25 Thread Andreas Kurz
On 10/25/2011 02:06 AM, Bob Schatz wrote:
 Andreas,
 
 Thanks for your help!
 
 Comments below with [BS]
 
 
 
 From: Andreas Kurz andr...@hastexo.com
 To: linux-ha@lists.linux-ha.org
 Sent: Wednesday, October 19, 2011 3:36 AM
 Subject: Re: [Linux-HA] Basic Question about LVM
 
 Hello,
 
 On 10/18/2011 11:59 PM, Bob Schatz wrote:
 I am trying to setup a LVM fail over cluster and I must be missing something 
 basic. :(

 The configuration I want is:

  IP address
   |
  File System
   |
   LVM

 I am not using DRBD.
 
 So you are using a shared storage device like a FC disk?
 
 [BS] Yes.  We are using shared disk.
 

 I am running this on Ubuntu 10.04LS.

 Everything works fine and I can migrate the group between two nodes.   
 However, if I reboot one node OR STONITH one node it causes the other node 
 to stop and then restart all resources.

 The problem is that when the node reboots, it activates the LVM volume group 
 and then Pacemaker says native_add_running: Resource ocf::LVM:volume-lvm-p 
 appears to be active on 2 nodes.  This causes the group to stop and then be 
 restarted.

 I tried to play with /etc/lvm/lvm.conf filtering but that just prevented the 
 disks from being read even by the agent.
 
 Adopted volume_list Parameter to not activate all vgs? After doing
 changes to lvm.conf you also need to recreate initramfs.
 
 Safest and recommended setup is to use clvmd with dlm and update your vg
 to a clustered one (also adopt the locking type in lvm.conf).
 
 Last step is to integrate dlm/clvmd/lvm in your Pacemaker setup and you
 are ready to go.
 
 [BS]  Thanks.  I am studying the various docs I have found on setting up 
 clvmd and it looks pretty straightforward.  
   One question I have is do I need to configure the o2cb resource?  I 
 do not want a shared file system and therefore I am
   not going to use OCFS2.   It seemed to me that o2cb only makes sense if you 
 have OCFS2.

No need for o2cb if you are not using OCFS2.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 
 Thanks again,
 
 Bob
 
 
 Regards,
 Andreas
 





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] what if brain split happens

2011-10-25 Thread Andreas Kurz
On 10/25/2011 07:51 PM, Hai Tao wrote:
 
 actually what I saw is that both nodes shut down heartbeat, and then 
 restarted heartbeat

I guess you are using Heartbeat v1  without crm but haresources
config file?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 
 Thanks.
  
 Hai Tao
  
 
 
 Date: Tue, 25 Oct 2011 10:35:53 -0700
 From: david.l...@digitalinsight.com
 To: dmaz...@bmrb.wisc.edu; linux-ha@lists.linux-ha.org
 Subject: Re: [Linux-HA] what if brain split happens

 On Tue, 25 Oct 2011, Dimitri Maziuk wrote:

 On 10/24/2011 10:49 PM, Hai Tao wrote:

 In case heartbeat communication is lost, brain split then happened, both 
 nodes (a two nodes cluster for a simple example) are having the vip and 
 other resources.

 When the heartbeat commnication comes back, what will happen?

 1. both nodes will still having the vip and resources forever?
 2. both nodes realize that brain split has happened, and will restart 
 heartbeat?

 In theory -- #2 except they shouldn't restart heartbeat, one of them
 should stop the resources. In practice one of the interesting things
 that happen when the comms come back is you have a duplicate ip address
 (vip) on your network. That's not something you want to happen, so you
 better make sure one of the nodes is down before you restore the comms.

 actually, I believe that what happens is that both nodes stop the resouce, 
 and 
 then one of the nodes starts it.

 this solves the dup-IP problem because starting the resource re-sends the 
 appropriate ARP packets to clean up the network.

 David Lang
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] [DRBD-user] Questions Regarding Configuration

2011-10-25 Thread Andreas Kurz
Hello Nick,

On 10/25/2011 02:43 AM, Nick Khamis wrote:
 Hello Andreas,
 
 I did not want to post the following to the list.

hmmm ... was not exactly successful ;-)

 Thank you so much
 for your help thus far, it
 has enabled us to get up and running, and focus on other aspects of
 the project. 

Glad to help!

I am slowly
 starting to learn the pcmk concept the hard way! ;) As for:
 
 Cluster file system for Asterisk? Are you sure it's worth adding that
 extra layer?
 
 The idea is to cluster asterisk providing both failover and load
 balancing. We will attempt to do this using
 an ocf resource agent implemented by hastexo, or using a proxy. For
 this reason we are liking the idea of
 using a network filesystem for the asterisk config files. It's just
 prototyped right now using virutal machines.

So you really want it the extra hard way?! Cluster fs for some config
files ... Good luck ;-)

 
 Yes of course, we can have an active/active asterisk cluster with each
 instance managing its own config files
 and therefore, eliminating ocfs/gfs etc.. Right now I have to figure
 out how ocf:pacemaker:controld works. And
 get myself away from the not installed error We have a working
 distributed locking mechanism however, I
 cannot find any good information on what the resource agent requires,
 how it works etc...

Ubuntu 10.04 LTS - Lucid - is one the few distros having packages for
all stacks (ocfs2/gfs2) see:

http://martinloschwitz.wordpress.com/2011/10/24/updated-linux-cluster-stack-packages-for-ubuntu-10-04/

 
 As you know, the good thing about learning the hard way, and making
 ALL the mistakes is that, I will be able
 to idenify, and point out the mistakes others make when seeking help
 from the mailing list.

Nice to hear you want to give something back to the community.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Kind Regards,
 
 Nick from Toronto.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] PCMK + OCFS2

2011-10-25 Thread Andreas Kurz
hello,

On 10/25/2011 08:14 PM, Nick Khamis wrote:
 Hello Everyone,
 
 Moving forward, I noticed that there was not much documentation
 regarding getting the pcmk stack working
 with ocfs2. I have the configuration up and running however, missed
 the part regarding getting what is required
 for pcmk+ocfs support (to get ocf:pacemaker:controld +
 ocf:pacemaker:o2cb working).

yes, there is not that much documentation on that part.

 
 Everything is build from source using the latest version of Glue, RA,
 PCMK, and OpenAIS. OCFS2 works fine
 manually, and now I am trying to get corosync to handle it. This is on
 a prototype environment right now using
 Debian Squeeze however, will be using an EL like Red Hat for
 production. No stonith is required just yet, but if
 the documentation includes that as well it would be beneficial very soon.

When using a cluster fs fencing is not an option but an obligation!

If you already know your want to go with RHEL or some derivate you can
save some miles not going the OCFS2 path  I can recommend reading:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Clusters_from_Scratch/index.html

... you will see this is quite different from running OCFS2 on lets say
Debian.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Resources didn't failover when failed

2011-10-23 Thread Andreas Kurz
Hello,

On 10/23/2011 11:30 AM, James Smith wrote:
 Hi,
 
 I was presented with the following status of a two node cluster:
 
 [root@iscsi1cl2 primestaff]# crm_mon -fN
 Attempting connection to the cluster...
 Last updated: Sat Oct 22 18:10:07 2011
 Stack: openais
 Current DC: iscsi1cl2 - partition with quorum
 Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
 2 Nodes configured, 2 expected votes
 2 Resources configured.
 
 
 Online: [ iscsi1cl2 iscsi2cl2 ]
 
 Master/Slave Set: iscsidrbdclone
  Masters: [ iscsi2cl2 ]
  Slaves: [ iscsi1cl2 ]
 Resource Group: coregroup
  ClusterIP  (ocf::heartbeat:IPaddr2):   Started iscsi2cl2
  iscsitarget(ocf::heartbeat:iSCSITarget):   Started iscsi2cl2 
 FAILED
  iscsilun   (ocf::heartbeat:iSCSILogicalUnit):  Started iscsi2cl2 
 (unmanaged) FAILED
  mail_me(ocf::heartbeat:MailTo):Stopped
 
 Migration summary:
 * Node iscsi2cl2:
iscsitarget: migration-threshold=3 fail-count=1
iscsilun: migration-threshold=3 fail-count=100
ClusterIP: migration-threshold=3 fail-count=1
 * Node iscsi1cl2:
 
 Failed actions:
 iscsitarget_monitor_1000 (node=iscsi2cl2, call=16, rc=-2, status=Timed 
 Out): unknown exec error
 iscsilun_stop_0 (node=iscsi2cl2, call=24, rc=-2, status=Timed Out): 
 unknown exec error

Error on stop -- no fencing configured. Cluster does not know if
resource is still running and blocks.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 In this instance, I am wondering why the resource didn't failover to the 
 awaiting secondary server? :(
 
 Config is below:
 
 crm(live)configure# show
 node iscsi1cl2 \
 attributes standby=off
 node iscsi2cl2 \
 attributes standby=off
 primitive ClusterIP ocf:heartbeat:IPaddr2 \
 params ip=10.100.0.101 cidr_netmask=255.255.255.0 nic=vlan158 \
 op monitor interval=1s \
 meta migration-threshold=3
 primitive iscsidrbd ocf:linbit:drbd \
 params drbd_resource=iscsidisk \
 op monitor interval=15s role=Master timeout=30s \
 op monitor interval=16s role=Slave timeout=31s \
 meta migration-threshold=3
 primitive iscsilun ocf:heartbeat:iSCSILogicalUnit \
 params implementation=iet lun=2 
 target_iqn=iqn.2010-05.iscsicl2:LUN02.sanvol path=/dev/drbd0 
 scsi_id=19101000101cl2iscsi \
 op monitor interval=1s timeout=5s \
 meta target-role=Started migration-threshold=3
 primitive iscsitarget ocf:heartbeat:iSCSITarget \
 params implementation=iet iqn=iqn.2010-05.iscsicl2:LUN02.sanvol 
 portals= \
 meta target-role=Started migration-threshold=3 \
 op monitor interval=1s timeout=5s
 primitive mail_me ocf:heartbeat:MailTo \
 params email=a...@a.com \
 op start interval=0 timeout=60s \
 op stop interval=0 timeout=60s \
 op monitor interval=10 timeout=10 depth=0
 group coregroup ClusterIP iscsitarget iscsilun mail_me
 ms iscsidrbdclone iscsidrbd \
 meta master-max=1 master-node-max=1 clone-max=2 
 clone-node-max=1 notify=true target-role=Started migration-threshold=3
 colocation core_group-with-iscsidrbdclone inf: coregroup iscsidrbdclone:Master
 order iscsidrbdclone-before-core_group inf: iscsidrbdclone:promote 
 iscsitarget:start
 property $id=cib-bootstrap-options \
 dc-version=1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 last-lrm-refresh=1315228396
 rsc_defaults $id=rsc-options \
 resource-stickiness=100
 
 Regards,
 
 James
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] [DRBD-user] Ensuring drbd is started before mounting filesystem

2011-10-23 Thread Andreas Kurz
On 10/23/2011 11:18 PM, Nick Khamis wrote:
 Hello Everyone,
 
 I was wondering if it's possible to use the order directive to ensure that
 drbd is fully started before attempting to mount the filesystem? I tried the
 following:
 
 node mydrbd1 \
attributes standby=off
 node mydrbd2 \
attributes standby=off
 primitive myIP ocf:heartbeat:IPaddr2 \
   op monitor interval=60 timeout=20 \
 params ip=192.168.2.5 cidr_netmask=24 \
 nic=eth1 broadcast=192.168.2.255 \
   lvs_support=true
 primitive myDRBD ocf:linbit:drbd \
   params drbd_resource=r0.res \
   op monitor role=Master interval=10 \
   op monitor role=Slave interval=30
 ms msMyDRBD myDRBD \
   meta master-max=1 master-node-max=1 \
   clone-max=2 clone-node-max=1 \
   notify=true globally-unique=false
 primitive myFilesystem ocf:heartbeat:Filesystem \
   params device=/dev/drbd0 directory=/service fstype=ext3 \
 op monitor interval=15 timeout=60 \
 meta target-role=Started
 group MyServices myIP myFilesystem meta target-role=Started
 order drbdAfterIP \
   inf: myIP msMyDRBD
 order filesystemAfterDRBD \
   inf: msMyDRBD:promote myFilesystem:start

There is no colocation between the DRBD master and the filesystem 

 location prefer-mysql1 MyServices inf: mydrbd1
 location prefer-mysql2 MyServices inf: mydrbd2

? ... these constraints make no sense ...

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] [DRBD-user] Ensuring drbd is started before mounting filesystem

2011-10-23 Thread Andreas Kurz
Sorry for the noise ... address cut  paste error

Regards,
Andreas

On 10/24/2011 12:12 AM, Andreas Kurz wrote:
 On 10/23/2011 11:18 PM, Nick Khamis wrote:
 Hello Everyone,

 I was wondering if it's possible to use the order directive to ensure that
 drbd is fully started before attempting to mount the filesystem? I tried the
 following:

 node mydrbd1 \
attributes standby=off
 node mydrbd2 \
attributes standby=off
 primitive myIP ocf:heartbeat:IPaddr2 \
  op monitor interval=60 timeout=20 \
 params ip=192.168.2.5 cidr_netmask=24 \
 nic=eth1 broadcast=192.168.2.255 \
  lvs_support=true
 primitive myDRBD ocf:linbit:drbd \
  params drbd_resource=r0.res \
  op monitor role=Master interval=10 \
  op monitor role=Slave interval=30
 ms msMyDRBD myDRBD \
  meta master-max=1 master-node-max=1 \
  clone-max=2 clone-node-max=1 \
  notify=true globally-unique=false
 primitive myFilesystem ocf:heartbeat:Filesystem \
  params device=/dev/drbd0 directory=/service fstype=ext3 \
 op monitor interval=15 timeout=60 \
 meta target-role=Started
 group MyServices myIP myFilesystem meta target-role=Started
 order drbdAfterIP \
  inf: myIP msMyDRBD
 order filesystemAfterDRBD \
  inf: msMyDRBD:promote myFilesystem:start
 
 There is no colocation between the DRBD master and the filesystem 
 
 location prefer-mysql1 MyServices inf: mydrbd1
 location prefer-mysql2 MyServices inf: mydrbd2
 
 ? ... these constraints make no sense ...
 
 Regards,
 Andreas
 
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems





signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] [DRBD-user] Questions Regarding Configuration

2011-10-23 Thread Andreas Kurz
On 10/23/2011 09:39 PM, Nick Khamis wrote:
 The following works as expected:
 
 node mydrbd1 \
attributes standby=off
 node mydrbd2 \
attributes standby=off
 primitive myIP ocf:heartbeat:IPaddr2 \
   op monitor interval=60 timeout=20 \
 params ip=192.168.2.5 cidr_netmask=24 \
 nic=eth1 broadcast=192.168.2.255 \
   lvs_support=true
 primitive myDRBD ocf:linbit:drbd \
   params drbd_resource=r0.res \
   op monitor role=Master interval=10 \
   op monitor role=Slave interval=30
 ms msMyDRBD myDRBD \
   meta master-max=1 master-node-max=1 \
   clone-max=2 clone-node-max=1 \
   notify=true globally-unique=false
 group MyServices myIP
 order drbdAfterIP \
   inf: myIP msMyDRBD
 location prefer-mysql1 MyServices inf: mydrbd1
 location prefer-mysql2 MyServices inf: mydrbd2

??

 property $id=cib-bootstrap-options \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 expected-quorum-votes=5 \
 dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \
 cluster-recheck-interval=0 \
 cluster-infrastructure=openais
   rsc_defaults $id=rsc-options \
   resource-stickiness=100
 
 However, when modifying the order entry to:
 
 order drbdAfterIP \
   inf: myIP:promote msMyDRBD:start
 
 DRBD no longer works. And when adding the following colocation:

yes, the promote of the IP will never happen as it is a) only configured
as primitve and b) IPaddr2 does not support a promote action ... no IP
promote, no DRBD start ...

 
 colocation drbdOnIP \
   inf: MyServices msMyDRBD:Master
 
 none of the resources work.

tried removing those obscure two location constraints?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Inconsistencies between LRMD and CRM + Problem with MySQL OCF

2011-10-22 Thread Andreas Kurz
Hello,

On 10/22/2011 04:25 AM, Nick Khamis wrote:
 Hello Everyone,
 
 I have been strugglling with the MySQL OCF. On an unrelated, eyeing
 the logs I saw that some of the resources are shown as not running
 however.
 according to CRM, and checking them manually, resources such as myIP
 and myFilesystem are running:
 
 Online: [ mydrbd1 mydrbd2 ]
 OFFLINE: [ lb2 lb1 astdrbd1 astdrbd2 ]
 
  Resource Group: MyServices
  myIP   (ocf::heartbeat:IPaddr2):   Started mydrbd2
  myFilesystem   (ocf::heartbeat:Filesystem):Started mydrbd2
  Master/Slave Set: msMyDRBD [myDRBD]
  Masters: [ mydrbd2 ]
  Slaves: [ mydrbd1 ]
 
 Failed actions:
 mysql_monitor_0 (node=mydrbd1, call=2, rc=5, status=complete): not 
 installed
 mysql_monitor_0 (node=mydrbd2, call=2, rc=5, status=complete): not 
 installed

So mysql is not installed on nodes mydrbd1/2 ... probing them leads to
above error ... all is fine if mysql should never run there and is
therefore not installed.

You don't get rid of this errors unless you install mysql there (or
some form of dummy).

 
 Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing
 key=9:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0 op=mysql_monitor_0 )
 Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:mysql:2: probe
 Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing
 key=10:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0 op=myIP_monitor_0 )
 Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:myIP:3: probe
 Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing
 key=11:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0
 op=myFilesystem_monitor_0 )
 Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:myFilesystem:4: probe
 Oct 21 21:58:09 mydrbd2 crmd: [22525]: info: do_lrm_rsc_op: Performing
 key=12:0:7:1fa0a769-05a7-4891-ac9c-dafacee2e0f0 op=myDRBD:0_monitor_0
 )
 Oct 21 21:58:09 mydrbd2 lrmd: [22522]: info: rsc:myDRBD:0:5: probe
 Oct 21 21:58:10 mydrbd2 crmd: [22525]: info: process_lrm_event: LRM
 operation mysql_monitor_0 (call=2, rc=5, cib-update=9, confirmed=true)
 not installed
 Oct 21 21:58:10 mydrbd2 crmd: [22525]: info: process_lrm_event: LRM
 operation myIP_monitor_0 (call=3, rc=7, cib-update=10, confirmed=true)
 not running
 Oct 21 21:58:11 mydrbd2 crmd: [22525]: info: process_lrm_event: LRM
 operation myFilesystem_monitor_0 (call=4, rc=7, cib-update=11,
 confirmed=true) not running

These are all expected results of the initial probing (monitor_0) of all
resources before the Cluster starts with resource allocation  rc=7
if a resource is not running, rc5 if necessary binarys/configs are not
installed.

All informational stuff, cluster needs to know if there is any resource
already started outside of the cluster  no need to worry ;-)

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 That is all I could find in regards to the MySQL related errors. Using
 the latest version of pcmk build from source. Also tried the newest
 version of MySQL OCF
 downloaded from GIT.
 
 Last updated: Fri Oct 21 22:13:08 2011
 Last change: Fri Oct 21 21:54:46 2011 via cibadmin on mydrbd1
 Stack: openais
 Current DC: mydrbd1 - partition WITHOUT quorum
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 6 Nodes configured, 5 expected votes
 5 Resources configured.
 
 
 Thanks in Advance,
 
 Nick.
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Basic Question about LVM

2011-10-19 Thread Andreas Kurz
Hello,

On 10/18/2011 11:59 PM, Bob Schatz wrote:
 I am trying to setup a LVM fail over cluster and I must be missing something 
 basic. :(
 
 The configuration I want is:
 
 IP address
  |
 File System
  |
  LVM
 
 I am not using DRBD.

So you are using a shared storage device like a FC disk?

 
 I am running this on Ubuntu 10.04LS.
 
 Everything works fine and I can migrate the group between two nodes.   
 However, if I reboot one node OR STONITH one node it causes the other node to 
 stop and then restart all resources.
 
 The problem is that when the node reboots, it activates the LVM volume group 
 and then Pacemaker says native_add_running: Resource ocf::LVM:volume-lvm-p 
 appears to be active on 2 nodes.  This causes the group to stop and then be 
 restarted.
 
 I tried to play with /etc/lvm/lvm.conf filtering but that just prevented the 
 disks from being read even by the agent.

Adopted volume_list Parameter to not activate all vgs? After doing
changes to lvm.conf you also need to recreate initramfs.

Safest and recommended setup is to use clvmd with dlm and update your vg
to a clustered one (also adopt the locking type in lvm.conf).

Last step is to integrate dlm/clvmd/lvm in your Pacemaker setup and you
are ready to go.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 
 What am I missing?
 
 Thanks,
 
 Bob
 
 
 My configuration is:
 
 node cc-vol-6-1
 node cc-vol-6-2
 primitive ipmilan-cc-vol-6-1 stonith:external/ipmi \
 params hostname=cc-vol-6-1 ipaddr=XXX userid=XXX 
 passwd=XXX \
 op start interval=0 timeout=60 \
 op stop interval=0 timeout=60 \
 op monitor interval=60 timeout=60 start-delay=0
 primitive ipmilan-cc-vol-6-2 stonith:external/ipmi \
 params hostname=cc-vol-6-2 ipaddr=XXX userid=XX 
 passwd=X! \
 op start interval=0 timeout=60 \
 op stop interval=0 timeout=60 \
 op monitor interval=60 timeout=60 start-delay=0
 primitive volume-fs-p ocf:heartbeat:Filesystem \
 params device=/dev/nova-volumes/nova-volumes-vol 
 directory=/volume-mount fstype=xfs \
 op start interval=0 timeout=60 \
 op monitor interval=60 timeout=60 OCF_CHECK_LEVEL=20 \
 op stop interval=0 timeout=120
 primitive volume-iscsit-ip-p ocf:heartbeat:IPaddr2 \
 params ip=YY nic=ZZ \
 op monitor interval=5s
 primitive volume-lvm-p ocf:heartbeat:LVM \
 params volgrpname=nova-volumes exclusive=true \
 op start interval=0 timeout=30 \
 op stop interval=0 timeout=30
 primitive volume-vol-ip-p ocf:heartbeat:IPaddr2 \
 params ip=X1x1x1x1x nic=y1y1y1 \
 op monitor interval=5s
 group volume-fs-ip-iscsi-g volume-lvm-p volume-fs-p volume-iscsit-ip-p 
 volume-vol-ip-p \
 meta target-role=Started
 location loc-ipmilan-cc-vol-6-1 ipmilan-cc-vol-6-1 -inf: cc-vol-6-1
 location loc-ipmilan-cc-vol-6-2 ipmilan-cc-vol-6-2 -inf: cc-vol-6-2
 property $id=cib-bootstrap-options \
 dc-version=1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 no-quorum-policy=ignore \
 stonith-enabled=true
 rsc_defaults $id=rsc-options \
 resource-stickiness=1000
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] violate uniqueness for parameter drbd_resource

2011-10-19 Thread Andreas Kurz
Hello,

On 10/19/2011 04:49 PM, Nick Khamis wrote:
 Hello Everyone,
 
 What we have is a 4 node cluster: 2 Running mysql on a active/passive,
 and 2 running our application on an active/active:
 
 MyDRBD1 and MyDRBD2: Mysql, DRBD (active/passive)
 ASTDRBD1 and ASTDRBD2: In-house application, DRBD dual primary
 A snippet of our config looks like this:
 
 node mydrbd1 \
attributes standby=off
 node mydrbd2 \
attributes standby=off
 node astdrbd1 \
attributes standby=off
 node astdrbd2 \
attributes standby=off
 primitive drbd_mysql ocf:linbit:drbd \
   params drbd_resource=r0.res \
   op monitor role=Master interval=10 \
   op monitor role=Slave interval=30
 .
 primitive drbd_asterisk ocf:linbit:drbd \
   params drbd_resource=r0.res \
   op monitor interval=20 timeout=20 role=Master \
   op monitor interval=30 timeout=20 role=Slave
 ms ms_drbd_asterisk drbd_asterisk \
   meta master-max=2 notify=true \
   interleave=true
 group MyServices myIP fs_mysql mysql \
   meta target-role=Started
 group ASTServices astIP asteriskDLM asteriskO2CB fs_asterisk \
   meta target-role=Started
 .
 
 I am recieving the following warning: WARNING: Resources
 drbd_asterisk,drbd_mysql violate uniqueness for parameter
 drbd_resource: r0.res
 Now the obvious thing to do is to change the resource name at the DRBD
 level however, I assumed that the parameter uniqueness was bound to
 the primitive?

Only one resource per cluster should use this value for this attribute
if it is marked globally-unique in the RA meta-information.

Do yourself a favour and give the DRBD resources a meaningful name, how
about asterisk and mysql ;-)

 
 My second quick question is, I like to use group + location to
 single out services on specific nodes however, when creating clones:
 
 clone cloneDLM asteriskDLM meta globally-unique=false interleave=true
 
 I am recieving ERROR: asteriskDLM already in use at ASTServices
 error? My question is, what are the benefits of using group + location
 vs. clone + location?

Once a resource is in a group it cannot be used for clones/MS any more
... though you can clone a group or make it MS.

 With the latter I assue we will have a long list of location (one for
 each primitive + node)? And with the former we do not have he meta
 information
 (globally-unique, and interleave)?

I assume you want to manage a cluster filesystem ... so put all the
dlm/o2cb/cluster-fs resources in a group and clone it (and use
interleave for this clone)

Regards,
Andreas

-- 
Need help with Pacemaker or DRBD?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] IPaddr / ifconfig deprecated

2011-10-19 Thread Andreas Kurz
Hello,

On 10/19/2011 04:35 PM, alain.mou...@bull.net wrote:
 Hi 
 
 Florian, just for information, following my remark last week on mysql 
 option -O deprecated, I also noticed in the script IPaddr the use of 
 ifconfig 

Please use IPaddr2 RA for current setups ... IPaddr is only here for
backwards compatibility and for platforms without ip utility.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

 command which is flagged as deprecated (at least on RH) and this generates 
 lots of  useless syslog messages.
 
 So I replace all the $IFCONFIG functions in IPaddr script (but only for 
 SYSTYPE=Linux as I'm on RHEL6) , and this seems to work fine.
 
 function delete_interface :
  CMD=ip addr del $ipaddr dev $ifname;;
 
 function find_generic_interface :
 ifname=`ip addr | grep $ipaddr | awk '{print $NF}'`
 case $ifname in
 *:*)  echo $ifname; return $OCF_SUCCESS  ;;
 *)  return $OCF_ERR_GENERIC;;
 esac
 
 function find_free_interface :
IFLIST=`ip addr | grep eth1:[0-9] | awk  '{print $NF}'`
 
 function add_interface : 
 CMD=`ip addr add $ipaddr/$CidrNetmask broadcast $broadcast 
 dev $iface_base label $iface`;;
 providing the CidrNetmask is retrieved in hex format from the iface_base 
 line, or given in hex format in the pacemaker primitive
 
 Just a few suggestions which seems to work ...
 Alain
 
 
 
 
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems






signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem with corosync and drbd 8.4.0

2011-10-18 Thread Andreas Kurz
On 10/18/2011 09:44 AM, SINN Andreas wrote:
 Hello!
 
 Thanks.
 
 Now it works on one node, but on the other, I get the following error in the 
 messages:
 
 ERROR: Couldn't find device [/dev/drbd0]. Expected /dev/??? to exist
 
 But the device exists:
 
 ls -la /dev/drbd0
 brw-rw 1 root disk 147, 0 Oct 18 09:37 /dev/drbd0
 
 and when I start only the drbd, it works fine.

Sorry, not enough information ... hard to comment on one single log
line, full log is needed.

What is the drbd status on both nodes when this error occours ... cat
/proc/drbd?

Regards,
Andreas

--
Need help with Pacemaker?
http://www.hastexo.com/now

 
 Please help.
 
 Thanks
 
 Andreas
 
 -Ursprüngliche Nachricht-
 Von: linux-ha-boun...@lists.linux-ha.org 
 [mailto:linux-ha-boun...@lists.linux-ha.org] Im Auftrag von Andreas Kurz
 Gesendet: Montag, 17. Oktober 2011 15:22
 An: linux-ha@lists.linux-ha.org
 Betreff: Re: [Linux-HA] Problem with corosync and drbd 8.4.0
 
 Hello,
 
 On 10/17/2011 02:53 PM, SINN Andreas wrote:
 Hello!

 I have installed drbd 8.4.0 on RHEL 6 and want to build a cluster with 
 corosync. The drbd runs without any problem.

 When I configure with crm and want to start, I get the following error in 
 the crm_mon:

 Failed actions:
 data_monitor_0 (node=cl-sftp-server1, call=2, rc=6, status=complete): 
 not configured
 data_monitor_0 (node=cl-sftp-server2, call=2, rc=6, 
 status=complete): not configured

 When I do a crm_verify -LV , I get the following errors:

 crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error 
 - data_monitor_0 failed with rc=6: Preventing data from re-starting 
 anywhere in the cluster
 crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error 
 - data_monitor_0 failed with rc=6: Preventing data from re-starting 
 anywhere in the cluster

 cat /etc/drbd.conf
 global {
   usage-count yes;
 }
 common {
   net {
 protocol C;
   }
 }
 resource data {
   meta-disk internal;
   device/dev/drbd0;
   syncer {
 verify-alg sha1;
   }
   net {
 allow-two-primaries;
 
 ^
 Don't enable this if you don't know what you are doing ...
 
   }
   on cl-sftp-server1 {
 disk   /dev/sda3;
 address10.100.49.101:7790;
   }

   on cl-sftp-server2 {
 disk   /dev/sda3;
 address10.100.49.102:7790;
   }

 crm configure show:
 node cl-sftp-server1
 node cl-sftp-server2
 primitive data ocf:linbit:drbd \
 params drbd_resource=data \
 op monitor interval=60s \
 op monitor interval=10 role=Master \
 op monitor interval=30 role=Slave
 property $id=cib-bootstrap-options \
 dc-version=1.0.11-a15ead49e20f047e129882619ed075a65c1ebdfe \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 default-action-timeout=240 \
 stonith-enabled=false

 Can you someone help? What is the failure?
 
 You need to configure a Master/Slave rescource, not only the
 primitive...e.g:
 
 master ms_data data meta notify=true
 
 should solve your problem ... and reading through your logs should also 
 reveal this ;-)
 
 Regards,
 Andreas
 
 --
 Need help with Pacemaker or DRBD?
 http://www.hastexo.com/now
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems
 
 
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Problem with corosync and drbd 8.4.0

2011-10-17 Thread Andreas Kurz
Hello,

On 10/17/2011 02:53 PM, SINN Andreas wrote:
 Hello!
 
 I have installed drbd 8.4.0 on RHEL 6 and want to build a cluster with 
 corosync. The drbd runs without any problem.
 
 When I configure with crm and want to start, I get the following error in the 
 crm_mon:
 
 Failed actions:
 data_monitor_0 (node=cl-sftp-server1, call=2, rc=6, status=complete): not 
 configured
 data_monitor_0 (node=cl-sftp-server2, call=2, rc=6, status=complete): not 
 configured
 
 When I do a crm_verify -LV , I get the following errors:
 
 crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error - 
 data_monitor_0 failed with rc=6: Preventing data from re-starting anywhere in 
 the cluster
 crm_verify[6097]: 2011/10/17_14:51:46 ERROR: unpack_rsc_op: Hard error - 
 data_monitor_0 failed with rc=6: Preventing data from re-starting anywhere in 
 the cluster
 
 cat /etc/drbd.conf
 global {
   usage-count yes;
 }
 common {
   net {
 protocol C;
   }
 }
 resource data {
   meta-disk internal;
   device/dev/drbd0;
   syncer {
 verify-alg sha1;
   }
   net {
 allow-two-primaries;

^
Don't enable this if you don't know what you are doing ...

   }
   on cl-sftp-server1 {
 disk   /dev/sda3;
 address10.100.49.101:7790;
   }
 
   on cl-sftp-server2 {
 disk   /dev/sda3;
 address10.100.49.102:7790;
   }
 
 crm configure show:
 node cl-sftp-server1
 node cl-sftp-server2
 primitive data ocf:linbit:drbd \
 params drbd_resource=data \
 op monitor interval=60s \
 op monitor interval=10 role=Master \
 op monitor interval=30 role=Slave
 property $id=cib-bootstrap-options \
 dc-version=1.0.11-a15ead49e20f047e129882619ed075a65c1ebdfe \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 default-action-timeout=240 \
 stonith-enabled=false
 
 Can you someone help? What is the failure?

You need to configure a Master/Slave rescource, not only the
primitive...e.g:

master ms_data data meta notify=true

should solve your problem ... and reading through your logs should also
reveal this ;-)

Regards,
Andreas

-- 
Need help with Pacemaker or DRBD?
http://www.hastexo.com/now

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Ordered resources

2011-05-13 Thread Andreas Kurz
On 2011-05-13 08:55, Maxim Ianoglo wrote:
 Hello,
 
 I have the following configuration:
 Nodes: Node_A and Node_B
 
 Resources: WWW ( gr_apache_www ), NFS Server ( gr_storage_server ), NFS 
 Client ( gr_storage_client )
 
 Locations:
 gr_apache_www: By Default on Node_A, failover to Node_B
 gr_storage_server: By Default on Node_A, failover to Node_B
 gr_storage_client: By Default on Node_A, failovers just in case Node_A was 
 brought back online but gr_storage_server will not be moved to it's default 
 location for now, but gr_apache_www will.
 
 Constraints:
 colocation colo_storage -inf: gr_storage_client gr_storage_server
 order ord_storage inf: gr_storage_server gr_storage_client
 order ord_www inf: gr_storage_server ( gr_apache_www_main )
 order ord_www2 inf: gr_storage_client ( gr_apache_www_main )
 
 Now I have the situation:
 I put Node_A in standby so ALL resources should go to Node_B ( except for 
 gr_storage_client ), but for some reason only gr_storage_server is moved to 
 Node_B.
 gr_apache_www is not even started.
 
 How can I make gr_apache_www start even if gr_storage_client is not running 
 anywhere ? But if it is running anywhere it should run after 
 gr_storage_client.
 

order ord_www2 0: gr_storage_client gr_apache_www_main

... an advisory order constraint.

Regards,
Andreas


 Configuration ( constraints and cluster options ):
 
 location loc_gr_apache_www_default gr_apache_www \
 rule $id=prefered_loc_gr_apache_default 100: #uname eq Node_A
 location loc_gr_apache_www_failover gr_apache_www \
 rule $id=prefered_loc_gr_apache_failover 50: #uname eq Node_B
 location loc_gr_storage_server_default gr_storage_server \
 rule $id=prefered_loc_gr_storage_server_default 100: #uname eq 
 Node_A
 location loc_gr_storage_server_failover gr_storage_server \
 rule $id=prefered_loc_gr_storage_server_failover 50: #uname eq 
 Node_B
 colocation colo_storage -inf: gr_storage_client gr_storage_server
 order ord_nfslock_storage_client inf: gr_storage_client clone_nfslock
 order ord_nfslock_storage_server inf: gr_storage_server clone_nfslock
 order ord_storage inf: gr_storage_server gr_storage_client
 order ord_www inf: gr_storage_server ( gr_nginx_static gr_apache_www_main )
 order ord_www2 inf: gr_storage_client ( gr_nginx_static gr_apache_www_main )
 property $id=cib-bootstrap-options \
 symmetric-cluster=true \
 no-quorum-policy=ignore \
 stonith-enabled=false \
 stonith-action=reboot \
 startup-fencing=true \
 stop-orphan-resources=true \
 stop-orphan-actions=true \
 remove-after-stop=false \
 default-action-timeout=60s \
 is-managed-default=true \
 cluster-delay=60s \
 pe-error-series-max=-1 \
 pe-warn-series-max=-1 \
 pe-input-series-max=-1 \
 dc-version=1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87 \
 last-lrm-refresh=1305263051 \
 cluster-infrastructure=Heartbeat
 rsc_defaults $id=rsc-options \
 resource-stickiness=100
 ===
 
 Thank you.
 --
 Maxim Ianoglo
 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems




signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] CIB process quits and could not connect to CRM

2011-03-10 Thread Andreas Kurz
On 03/10/2011 06:30 AM, Tiruvenkatasamy Baskaran wrote:
 Hi,
 I have installed heartbeat-3.0.3-2.el5.x86_64.rpm and 
 pacemaker-1.1.2-7.el6.x86_64.rpm on RHEL.

I'm quite sure RHELs version is compiled without Heartbeat support ...
try corosync.

Regards,
Andreas

 I have configured ha.cf ,authkeys and cib.xml as follows.
 When I start heartbeart ,it will in turn start the crm as crm is configured 
 in ha.cf file.
 With this mail ha-log file is attached
 
 I could not connect to CRM
 [root@pcmk-1 crm]# crm configure show
 Signon to CIB failed: connection failed
 Init failed, could not perform requested operations
 ERROR: cannot parse xml: no element found: line 1, column 0
 ERROR: No CIB!
 
 If I look into log the following message is found
 
 Mar 09 17:49:25 pcmk-2 stonith-ng: [5328]: CRIT: get_cluster_type: This 
 installation of Pacemaker does not support the '(null)' cluster 
 infrastructure.  Terminating.
 
 It was starting cib process
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client 
 /usr/lib64/heartbeat/cib (495,489)
 But after some time cib process quits
   Mar 09 17:49:26 pcmk-2 heartbeat: [5313]: WARN: Managed 
 /usr/lib64/heartbeat/cib process 5326 exited with return code 100.
 
 Can anyone tell me why cib process quits and This installation of Pacemaker 
 does not support the '(null)' cluster infrastructure.  Terminating. is being 
 displayed
 
 For more info look into ha.log file
 ha.cf
 logfile /var/log/ha-log
 logfacility local0
 keepalive 2
 deadtime 30
 initdead 120
 udpport 694
 bcast eth0# Linux
 auto_failback on
 node  pcmk-1  pcmk-2
 crm respawn
 authkeys
 auth 2
 2 sha1 test-ha
 cib.xml
 cib
configuration
  crm_config
cluster_property_set id=cib-bootstrap-options
  attributes/
/cluster_property_set
  /crm_config
  nodes
node uname=pcmk-1 type=normal
  id=f11899c3-ed6e-4e63-abae-b9af90c62283/
node uname=pcmk-2 type=normal
  id=663bae4d-44a0-407f-ac14-389150407159/
  /nodes
  resources/
  constraints/
/configuration
 /cib
 ha-log
 
 Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: info: Version 2 support: respawn
 Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: WARN: File /etc/ha.d//haresources 
 exists.
 Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: WARN: This file is not used because 
 crm is enabled
 Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: WARN: Logging daemon is disabled 
 --enabling logging daemon is recommended
 Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: info: **
 Mar 09 17:47:25 pcmk-2 heartbeat: [5311]: info: Configuration validated. 
 Starting heartbeat 3.0.2
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: heartbeat: version 3.0.2
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: Heartbeat generation: 
 1299567998
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: glib: UDP Broadcast heartbeat 
 started on port 694 (694) interface eth0
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: glib: UDP Broadcast heartbeat 
 closed on port 694 interface eth0 - Status: 1
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: G_main_add_TriggerHandler: 
 Added signal manual handler
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: G_main_add_TriggerHandler: 
 Added signal manual handler
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: G_main_add_SignalHandler: 
 Added signal handler for signal 17
 Mar 09 17:47:25 pcmk-2 heartbeat: [5313]: info: Local status now set to: 'up'
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: WARN: node pcmk-1: is dead
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Comm_now_up(): updating 
 status to active
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Local status now set to: 
 'active'
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client 
 /usr/lib64/heartbeat/ccm (495,489)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client 
 /usr/lib64/heartbeat/cib (495,489)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5325]: info: Starting 
 /usr/lib64/heartbeat/ccm as uid 495  gid 489 (pid 5325)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5326]: info: Starting 
 /usr/lib64/heartbeat/cib as uid 495  gid 489 (pid 5326)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client 
 /usr/lib64/heartbeat/lrmd -r (0,0)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5327]: info: Starting 
 /usr/lib64/heartbeat/lrmd -r as uid 0  gid 0 (pid 5327)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client 
 /usr/lib64/heartbeat/stonithd (0,0)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5328]: info: Starting 
 /usr/lib64/heartbeat/stonithd as uid 0  gid 0 (pid 5328)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client 
 /usr/lib64/heartbeat/attrd (495,489)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5329]: info: Starting 
 /usr/lib64/heartbeat/attrd as uid 495  gid 489 (pid 5329)
 Mar 09 17:49:25 pcmk-2 heartbeat: [5313]: info: Starting child client 
 /usr/lib64/heartbeat/crmd (495,489)
 Mar 09 

  1   2   3   >