Re: [Linux-HA] [Pacemaker] [Cluster-devel] [ha-wg] [RFC] Organizing HA Summit 2015
- Original Message - On 25 Nov 2014, at 8:54 pm, Lars Marowsky-Bree l...@suse.com wrote: On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote: Yeah, well, devconf.cz is not such an interesting event for those who do not wear the fedora ;-) That would be the perfect opportunity for you to convert users to Suse ;) I´d prefer, at least for this round, to keep dates/location and explore the option to allow people to join remotely. Afterall there are tons of tools between google hangouts and others that would allow that. That is, in my experience, the absolute worst. It creates second class participants and is a PITA for everyone. I agree, it is still a way for people to join in tho. I personally disagree. In my experience, one either does a face-to-face meeting, or a virtual one that puts everyone on the same footing. Mixing both works really badly unless the team already knows each other. I know that an in-person meeting is useful, but we have a large team in Beijing, the US, Tasmania (OK, one crazy guy), various countries in Europe etc. Yes same here. No difference.. we have one crazy guy in Australia.. Yeah, but you're already bringing him for your personal conference. That's a bit different. ;-) OK, let's switch tracks a bit. What *topics* do we actually have? Can we fill two days? Where would we want to collect them? Personally I'm interested in talking about scaling - with pacemaker-remoted and/or a new messaging/membership layer. If we're going to talk about scaling, we should throw in our new docker support in the same discussion. Docker lends itself well to the pet vs cattle analogy. I see management of docker with pacemaker making quite a bit of sense now that we have the ability to scale into the cattle territory. Other design-y topics: - SBD - degraded mode - improved notifications - containerisation of services (cgroups, docker, virt) - resource-agents (upstream releases, handling of pull requests, testing) Yep, We definitely need to talk about the resource-agents. User-facing topics could include recent features (ie. pacemaker-remoted, crm_resource --restart) and common deployment scenarios (eg. NFS) that people get wrong. Adding to the list, it would be a good idea to talk about Deployment integration testing, what's going on with the phd project and why it's important regardless if you're interested in what the project functionally does. -- Vossel ___ Pacemaker mailing list: pacema...@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] RHEL Server 6.6 HA Configuration
- Original Message - I was trying to install Corosync and Cman using yum install -y pacemaker cman pcs ccs resource-agents This works fine on Centos 6.3. Tried the same on Redhat Redhat Enterprise Linux Server 6.6 and ran into issues. It gives error like Loaded plugins: product-id, refresh-packagekit, rhnplugin, security, subscription-manager There was an error communicating with RHN. RHN Satellite or RHN Classic support will be disabled. Error Message: Please run rhn_register as root on this client Error Class Code: 9 Error Class Info: Invalid System Credentials. Explanation: An error has occurred while processing your request. If this problem persists please enter a bug report at bugzilla.redhat.com. If you choose to submit the bug report, please be sure to include details of what you were trying to do when this error occurred and details on how to reproduce this problem. Setting up Install Process No package pacemaker available. No package cman available. No package pcs available. No package ccs available. Nothing to do centos.repo is as follows... vim /etc/yum.repos.d/centos.repo is as below [centos-6-base] name=CentOS-$releasever - Base mirrorlist=http://mirrorlist.centos.org/?release=$releaseverarch=$basearchrepo=os enabled=0 #baseurl=http://mirror.centos.org/centos/$releasever/os/$basearch/ Realized that this Redhat version does not have the High availability addon package. This package needs to be bought OR the version needs to be upgraded to 7. I got information like Pacemaker has been available as part of RHEL, since 6.0 as part of the High Availability (HA) add-on. Question: 1.Is the above understanding correct? yes 2.Are there significant differences in the manner Corosync and CMan are configured on Enterprise server Vs Centos? Pacemaker in Centos 6.6 and RHEL 6.6 should be configured the same way. -- Vossel Thank You, Ranjan -- View this message in context: http://linux-ha.996297.n3.nabble.com/RHEL-Server-6-6-HA-Configuration-tp15945.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Heartbeat in Amazon VMs doest not create virtaul ip address
- Original Message - Hi, I installed on HeartBeat,Centos 6.5 on 2 Amazon EC2 machinesthis is the If you have an option, I'd strongly recommend using the Pacemaker+CMAN stack in rhel 6.5. Red Hat began supporting pacemaker in 6.5, so it should be available to you. -- Vossel version: [root@ip-10-0-2-68 ha.d]# rpm -qa | grep heartbeat heartbeat-libs-3.0.4-2.el6.x86_64 heartbeat-3.0.4-2.el6.x86_64 heartbeat-devel-3.0.4-2.el6.x86_64 the floating IP is [root@ip-10-0-2-68 ha.d]# cat haresources ip-10-0-2-68 10.0.2.70 but it is not created on any machine, it does not matter where I do the takeover or standby commands what am I missing? is this even possible ? these are my setting in ha.cf logfacility local0 ucast eth0 10.0.2.69 auto_failback on node ip-10-0-2-68 ip-10-0-2-69 ping 10.0.2.1 use_logd yes logfacility local0 ucast eth0 10.0.2.68 auto_failback on node ip-10-0-2-68 ip-10-0-2-69 ping 10.0.2.1 use_logd yes these is the output of the route command [root@ip-10-0-2-68 ha.d]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.0.2.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 0.0.0.0 10.0.2.1 0.0.0.0 UG 0 0 0 eth0 [root@ip-10-0-2-68 ha.d]# this is how the interfaces eth0 are set up on machine 1:[root@ip-10-0-2-68 ha.d]# ifconfig eth0 Link encap:Ethernet HWaddr 12:23:49:EF:3A:53 inet addr:10.0.2.68 Bcast:10.0.2.255 Mask:255.255.255.0 inet6 addr: fe80::1023:49ff:feef:3a53/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1 RX packets:269823 errors:0 dropped:0 overruns:0 frame:0 TX packets:192305 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:167802149 (160.0 MiB) TX bytes:48341828 (46.1 MiB) Interrupt:247 these are the logs showing everything going on fine but when doing ifconfig the interface is not there: Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: WARN: node ip-10-0-2-69: is dead Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: info: Comm_now_up(): updating status to active Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: info: Local status now set to: 'active' Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: WARN: No STONITH device configured. Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: WARN: Shared disks are not protected. Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: info: Resources being acquired from ip-10-0-2-69. Nov 11 21:37:39 ip-10-0-2-68 mach_down(default)[14769]: info: /usr/share/heartbeat/mach_down: nice_failback: foreign resources acquired Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: info: mach_down takeover complete. Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14681]: [14681]: info: Initial resource acquisition complete (mach_down) Nov 11 21:37:39 ip-10-0-2-68 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_10.0.2.70)[14845]: INFO: Resource is stopped Nov 11 21:37:39 ip-10-0-2-68 heartbeat[14701]: [14701]: info: Local Resource acquisition completed. Nov 11 21:37:40 ip-10-0-2-68 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_10.0.2.70)[14958]: INFO: Resource is stopped Nov 11 21:37:40 ip-10-0-2-68 IPaddr(IPaddr_10.0.2.70)[15057]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.0.2.70 eth0 10.0.2.70 auto not_used not_used Nov 11 21:37:40 ip-10-0-2-68 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_10.0.2.70)[15064]: INFO: Success Nov 11 21:37:49 ip-10-0-2-68 heartbeat[14681]: [14681]: info: Local Resource acquisition completed. (none) Nov 11 21:37:49 ip-10-0-2-68 heartbeat[14681]: [14681]: info: local resource transition completed. Nov 11 21:38:16 ip-10-0-2-69 heartbeat: [18350]: WARN: node ip-10-0-2-68: is dead Nov 11 21:38:16 ip-10-0-2-69 heartbeat: [18350]: info: Comm_now_up(): updating status to active Nov 11 21:38:16 ip-10-0-2-69 heartbeat: [18350]: info: Local status now set to: 'active' Nov 11 21:38:16 ip-10-0-2-69 heartbeat: [18350]: WARN: No STONITH device configured. Nov 11 21:38:16 ip-10-0-2-69 heartbeat: [18350]: WARN: Shared disks are not protected. Nov 11 21:38:16 ip-10-0-2-69 heartbeat: [18350]: info: Resources being acquired from ip-10-0-2-68. Nov 11 21:38:16 ip-10-0-2-69 heartbeat: [18360]: info: No local resources [/usr/share/heartbeat/ResourceManager listkeys ip-10-0-2-69] to acquire. Nov 11 21:38:17 ip-10-0-2-69 /usr/lib/ocf/resource.d//heartbeat/IPaddr(IPaddr_10.0.2.70)[18441]: INFO: Resource is stopped Nov 11 21:38:17 ip-10-0-2-69 IPaddr(IPaddr_10.0.2.70)[18537]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.0.2.70 eth0 10.0.2.70 auto not_used not_used Nov 11 21:38:17 ip-10-0-2-69
Re: [Linux-HA] MySQL resource agent and MySQL Master/Slave replication
- Original Message - Dear guys, I have set up a MySQL Master/Slave replication and tried to make it HA using corosync/pacemaker. I used the Mysql Resource Agent currently provided by Debian Wheezy and set up the corosync primitives as mentioned in the examples[1]. When migrating the mysql resource from one node to the other, the mysql resource agent didn't handle the slaves at all. As I inspected the code of the mysql resource agent, I figured out, that all the slave handling is done in the mysql_notify function of the resource agent. To enable the multi state resource ms_mysql to send notifications to the mysql resource agent, I added: ms ms_mysql p_mysql \ meta clone-max=3 \ *meta notify=true* Now the MySQL replication slaves get started/stopped as expected. You may add this parameter to your documentation[2], so other users don't have to figure it out themselves :-) good observation, thanks for the feedback! -- Vossel cheers Tom -- unixum Tom Hutter Hufnertwiete 3 D-22305 Hamburg Telefon: +49 40 67 39 21 52 mobil : +49 174 400 24 16 E-Mail: tom.hut...@unixum.de www.unixum.de Steuer-Nr.: 43/102/01125 Geschäftsführung: Tom Hutter [1] http://www.linux-ha.org/wiki/Mysql_%28resource_agent%29 [2] http://www.linux-ha.org/wiki/Mysql_%28resource_agent%29#MySQL_master.2Fslave_replication ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Remote node attributes support in crmsh
- Original Message - 22.10.2014 12:02, Dejan Muhamedagic wrote: On Mon, Oct 20, 2014 at 07:12:23PM +0300, Vladislav Bogdanov wrote: 20.10.2014 18:23, Dejan Muhamedagic wrote: Hi Vladislav, Hi Dejan! On Mon, Oct 20, 2014 at 09:03:40AM +0300, Vladislav Bogdanov wrote: Hi Kristoffer, do you plan to add support for recently added remote node attributes feature to chmsh? Currently (at least as of 2.1, and I do not see anything relevant in the git log) crmsh fails to update CIB if it contains node attributes for remote (bare-metal) node, complaining that duplicate element is found. No wonder :) The uname effectively dubs as an element id. But for bare-metal nodes it is natural to have ocf:pacemaker:remote resource with name equal to remote node uname (I doubt it can be configured differently). Is that required? Didn't look in code, but seems like yes, :remote resource name is the only place where pacemaker can obtain that node name. I find it surprising that the id is used to carry information. I'm not sure if we had a similar case (apart from attributes). If I comment check for 'obj_id in id_set', then it fails to update CIB because it inserts above primitive definition into the node section. Could you please show what would the CIB look like with such a remote resource (in crmsh notation). node 1: node01 node rnode001:remote \ attributes attr=value primitive rnode001 ocf:pacemaker:remote \ params server=192.168.168.20 \ op monitor interval=10 \ meta target-role=Started What do you expect to happen when you reference rnode001, in say: That is not me ;) I just want to be able to use crmsh to assign remote node operational and utilization (?) attributes and to work with it after that. Probably that is not yet set in stone, and David may change that allowing to f.e. new 'node_name' parameter to ocf:pacemaker:remote override remote node name guessed from the primitive name. David, could you comment please? why we would want to separate the remote-node from the resource's primative instance name? -- David Best, Vladislav crm configure show rnode001 I'm still trying to digest having hostname used to name some other element. Wonder what/where else will we have issues for this reason. Cheers, Dejan Best, Vladislav Given that nodes are for the most part referenced by uname (instead of by id), do you think that a configuration where a primitive element is named the same as a node, the user can handle that in an efficient manner? (NB: No experience here with ocf:pacemaker:remote :) Cheers, Dejan Best, Vladislav ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Virtual address for slave
- Original Message - Hello! I'd like to have two virtual adresses: vip-master and vip-slave. vip-master should be bound to master mode, vip-slave should be bound to slave node. How can I do it ? http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_multi_state_constraints Best regards Jarek ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Managed Failovers w/ NFS HA Cluster
- Original Message - I feel like this is something that must have been covered extensively already but I've done a lot of googling, looked at a lot of cluster configs, but have not found the solution. I have an HA NFS cluster (corosync+pacemaker). The relevant rpms are listed below but I'm not sure they are that important to the question which is this... When performing managed failovers of the NFS-exported file system resource from one node to the other (crm resource move), any active NFS clients experience an I/O error when the file system is unexported. In other words, you must unexport it to unmount it. As soon as it is unexported, clients are no longer able to write to it and experience an I/O error (rather than just blocking). In a failure scenario this is not a problem becuase the file system is never unexported on the primary server. Rather the server just goes down, the secondary takes over the resources and client I/O blocks until the process is complete and then goes about its business. We would like this same behavior for a *managed* failover but have not found a mount or export option/scenario that works. Is it possible? What am I missing? I realize this is more of an nfs/exportfs question but I would think that those implementing NFS HA clusters would be familiar with the scenario I'm describing. read this. NFS Active/Passive https://github.com/davidvossel/phd/blob/master/doc/presentations/nfs-ap-overview.pdf?raw=true NFS Active/Active https://github.com/davidvossel/phd/blob/master/doc/presentations/nfs-aa-overview.pdf?raw=true Note that the nfsnotify agent and the nfsserver agents have had a lot of work done to them in the last month or two upstream. Depending on what distro you are using, you may benefit from using the latest upstream agents (if rhel based definitely use the upstream agents.) -- Vossel Regards, Charlie Taylor pacemaker-cluster-libs-1.1.7-6.el6.x86_64 pacemaker-cli-1.1.7-6.el6.x86_64 pacemaker-1.1.7-6.el6.x86_64 pacemaker-libs-1.1.7-6.el6.x86_64 resource-agents-3.9.2-40.el6.x86_64 fence-agents-3.1.5-35.el6.x86_64 Red Hat Enterprise Linux Server release 6.3 (Santiago) Linux biostor3.ufhpc 2.6.32-279.19.1.el6.x86_64 #1 SMP Sat Nov 24 14:35:28 EST 2012 x86_64 x86_64 x86_64 GNU/Linux [root@biostor4 bs34]# crm status Last updated: Thu Jul 17 10:55:04 2014 Last change: Thu Jul 17 07:59:47 2014 via crmd on biostor3.ufhpc Stack: openais Current DC: biostor3.ufhpc - partition with quorum Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 2 Nodes configured, 2 expected votes 20 Resources configured. Online: [ biostor3.ufhpc biostor4.ufhpc ] Resource Group: grp_b3v0 vg_b3v0 (ocf::heartbeat:LVM): Started biostor3.ufhpc fs_b3v0 (ocf::heartbeat:Filesystem):Started biostor3.ufhpc ip_vbio3 (ocf::heartbeat:IPaddr2): Started biostor3.ufhpc ex_b3v0_1(ocf::heartbeat:exportfs): Started biostor3.ufhpc ex_b3v0_2(ocf::heartbeat:exportfs): Started biostor3.ufhpc ex_b3v0_3(ocf::heartbeat:exportfs): Started biostor3.ufhpc ex_b3v0_4(ocf::heartbeat:exportfs): Started biostor3.ufhpc ex_b3v0_5(ocf::heartbeat:exportfs): Started biostor3.ufhpc Resource Group: grp_b4v0 vg_b4v0 (ocf::heartbeat:LVM): Started biostor4.ufhpc fs_b4v0 (ocf::heartbeat:Filesystem):Started biostor4.ufhpc ip_vbio4 (ocf::heartbeat:IPaddr2): Started biostor4.ufhpc ex_b4v0_1(ocf::heartbeat:exportfs): Started biostor4.ufhpc ex_b4v0_2(ocf::heartbeat:exportfs): Started biostor4.ufhpc ex_b4v0_3(ocf::heartbeat:exportfs): Started biostor4.ufhpc ex_b4v0_4(ocf::heartbeat:exportfs): Started biostor4.ufhpc ex_b4v0_5(ocf::heartbeat:exportfs): Started biostor4.ufhpc st_bio3 (stonith:fence_ipmilan):Started biostor4.ufhpc st_bio4 (stonith:fence_ipmilan):Started biostor3.ufhpc ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Patch/recommendation for ocf:heartbeat:Filesystem cifs
- Original Message - From: Stefan Bauer (IZLBW Extern) stefan.ba...@iz.bwl.de To: linux-ha@lists.linux-ha.org Sent: Wednesday, June 18, 2014 1:42:14 AM Subject: [Linux-HA] Patch/recommendation for ocf:heartbeat:Filesystem cifs Dear Users/Developers, we're using ocf:heartbeat:Filesystem but fail to unmount cifs mounts if the cifs server went down. Please consider adding -l (lazy umount) to the umount_force variable in the RA. Does the umount -f option even make sense for cifs, or should we completely replace -f with -l when cifs is in use? the '-f' option only references NFS in the man page. You could propose this patch as a git pull request if you want. The resource-agent's source code is located here. https://github.com/ClusterLabs/resource-agents -- Vossel With the above option in use, we could unmounts the cifs share cleanly without running in any timeouts. Cheers Stefan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Packemaker resources for Galera cluster
- Original Message - From: Razvan Oncioiu ronci...@gmail.com To: linux-ha@lists.linux-ha.org Sent: Wednesday, June 4, 2014 11:48:01 PM Subject: [Linux-HA] Packemaker resources for Galera cluster Hello, I can't seem to find a proper way of setting up resources in pacemaker to manager my Galera cluster. I want a VIP that will failover betwen 5 boxes ( this works ), but I would also like to tie this into a resources that monitors mysql as well. if a mysql instance goes down, the VIP should move to another box that has mysql actually running. But I do not want pacemaker to start or stop the mysql service. Here is my current configuration: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Pacemaker_Explained/index.html#s-resource-options Make a cloned mysql resource and set the 'is-managed=false' meta attribute on the resource. Pacemaker will monitor if mysql is up, but not attempt to start/stop it. Like emmanuel said, you'll need the order constraint start mysql then start VIP. You'll also need a colocation constraint that will force the VIP to locate to a node with an active mysql service. colocate VIP with mysql so... - make VIP resource - make cloned mysql resource with is-managed=false - order start start mysql-clone then VIP - colocate VIP with mysql-clone Good luck! -- Vossel node galera01 node galera02 node galera03 node galera04 node galera05 primitive ClusterIP IPaddr2 \ params ip=10.10.10.178 cidr_netmask=24 \ meta is-managed=true \ op monitor interval=5s primitive p_mysql mysql \ params pid=/var/lib/mysql/mysqld.pid test_user=root test_passwd=goingforbroke \ meta is-managed=false \ op monitor interval=5s OCF_CHECK_LEVEL=10 \ op start interval=0 timeout=60s \ op stop interval=0 timeout=60s on-fail=standby group g_mysql p_mysql ClusterIP order order_mysql_before_ip Mandatory: p_mysql ClusterIP property cib-bootstrap-options: \ dc-version=1.1.10-14.el6_5.3-368c726 \ cluster-infrastructure=classic openais (with plugin) \ stonith-enabled=false \ no-quorum-policy=ignore \ expected-quorum-votes=5 \ last-lrm-refresh=1401942846 rsc_defaults rsc-options: \ resource-stickiness=100 -- View this message in context: http://linux-ha.996297.n3.nabble.com/Packemaker-resources-for-Galera-cluster-tp15668.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem with migration, priority, stickiness
- Original Message - From: Tony Stocker tony.stoc...@nasa.gov To: Linux HA Cluster Development List linux-ha@lists.linux-ha.org Sent: Tuesday, May 20, 2014 8:18:52 AM Subject: [Linux-HA] Problem with migration, priority, stickiness Cluster s/w specs: Kernel: 2.6.32-431.17.1.el6.x86_64 OS: CentOS 6.5 corosync-1.4.1-17.el6_5.1.x86_64 pacemaker-1.1.10-14.el6_5.3.x86_64 crmsh-2.0+git46-1.1.x86_64 Attached to this email are two text files, one contains the output of 'crm configure show' (addresses sanitized) and the other contains the output of 'crm_simulate -sL' Here is the situation, and we've encountered this multiple times now and I've been unable to solve it: * A machine in the cluster fails * There is a spare node, unused, in the cluster available for assignment * The resource group that was on the failed machine, instead of being put onto the spare, unused node is placed on a node where another resource group is already running * The displaced resource group then is launched on the spare, unused node As an example, this morning the following occurred: Resource Group NRTMASTER is running on system gpmhac01 Resource Group NRTPNODE1 is running on system gpmhac02 Resource Group NRTPNODE2 is running on system gpmhac05 Resource Group NRTPNODE3 is running on system gpmhac04 Resource Group NRTPNODE4 is running on system gpmhac03 system gpmhac06 is up, available, and unused system gpmhac04 fails and powers off Resource Group NRTPNODE3 is moved to system gpmhac05 Resource Group NRTNPODE2 is moved to system gpmhac06 One of the big things that seems to occur here is that while the group NRTPNODE3 is being launched on gpmhac05, the group NRTPNODE2 is being shut down simultaneously which is causing race conditions where one start script is putting a state file in place, while the stop script is erasing it. This leaves the system in an unuseable state because required files, parameters, and settings are missing/corrupted. Secondly, there is simply no reason to kill a perfectly healthy resource group, that is operating just fine in order to launch a resource group whose machine has failed when: 1. There's a spare node available 2. The resource groups have equal priority with each other, i.e. all of the NRTPNODE# resource groups have priority 60 So I really need some help here in getting this setup so that it behaves the way we *think* it should be doing based on what we understand of the Pacemaker architecture. Obviously we're missing something since this resource group shuffling occurs when there's a failed system, despite having an unused, spare node available for immediate use, and has bitten us several times. The fact that the race condition between startup and shutdown is also causing the system that is brought up to be useless is exacerbating the situation immensely. Ideally, this is what we want: 1. If a system fails, the resources/resource group running on it are moved to an unused, available system. No other resource shuffling occurs amongst system occurs. 2. If a system fails and there is not an unused, available system to fail over to, then IF the resource group has a higher priority than another resource group, the group with the lower priority is shutdown. Only when that shutdown is complete will the resource group with the higher priority start its startup of resources. 3. If a system fails and there is not an unused, available system to fail over to, then IF the resource group has the same or lower priority to all other resource groups, then it will not attempt to launch itself on any other node, nor cause any other resource group to stop or migrate. 4. Unless specifically, and manually, ordered to move OR if the hardware system fails, a resource group should remain on its current hardware system. It should never be forced to migrate to a new system because something of equal or lower priority failed and migrated to a new system. 5. We do not need resource groups to fail back to original nodes, when running we want them to stay running on their current system until/unless a hardware failure occurs and forces them off the system, or we manually tell them to move. Can someone please look over our configuration, and the bizzare scores that I see from the crm_simulate output, and help get me to the point where I can achieve an HA cluster that doesn't kill healthy resources in some kind of game of musical chairs when there's an empty chair available. Can you also tell me why or help
[Linux-HA] Active/Active nfs server lock recovery?
Hey, Has anyone had any success with deploying an Active/Active NFS server? I'm curious how lock recovery is performed. In a typically Active/Passive scenario we have a nfs-server instance coupled with the exportfs. The nfs lock info is stored on some shared storage that follows that nfs server and the exportfs instances around the cluster. This allows us to alert the nfs clients after the failover that the server rebooted and that they need to re-establish their locks. With an Active/Active setup, we'd have multiple nfs servers and exportfs instances, non of which are tied to one another. Meaning that the exportfs resources could run on any of the nfs server instances within the cluster. On failover, if we wanted the exportfs resources on a failed node to be taken over by another already existing nfs server on another node. In this instance does anyone know of a good way to alert the nfs clients previously connected to the old (failed) node that they need to re-establish their locks with the new node? It seems like the statd info from both the failed node's nfs server and the new node's nfs server would have to be merged or something. any thoughts? -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] How to tell pacemaker to process a new event during a long-running resource operation
- Original Message - From: Maloja01 maloj...@arcor.de To: Linux-HA linux-ha@lists.linux-ha.org Sent: Friday, March 14, 2014 5:32:34 AM Subject: [Linux-HA] How to tell pacemaker to process a new event during a long-running resource operation Hi all, I have a resource which could in special cases have a very long-running start operation. in-flight operations always have to complete before we can process a new transition. The only way we can transition earlier is by killing the in-flight process, which results in failure recovery and possibly fencing depending on what operation it is. There's really nothing that can be done to speed this up except work on lowering the startup time of that resource. -- Vossel If I have a new event (like switching a standby node back to online) during the already running transition (cluster is still S_TRANSITION_ENGINE) I would like the cluster to process them as soon as possible and not only after the other resource came up. Is that possible? I tried already batch-limit but I guess this is only to make actions parallel in a combined transition, right? Thanks in advance ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Announcing docker resource-agent
Hey, I've created a docker resource agent that allows docker containers to be managed with pacemaker. The agent is up for review here, https://github.com/ClusterLabs/resource-agents/pull/370 Docker is a relatively new and fast moving project. I'd be surprised if anyone here is using it in production yet, but I'm sure some of you have investigated how it could be used. For review feedback, I'm not so much interested in a code review as much as a use-case analysis. How do you use or foresee yourself using docker containers in an HA environment, and does this agent work for your use-case? Thanks, -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Announcing a new HA KVM tutorial!
- Original Message - From: Digimer li...@alteeve.ca To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Monday, January 6, 2014 10:19:05 AM Subject: [Linux-HA] Announcing a new HA KVM tutorial! Almost exactly two years ago, I released the first tutorial for building an HA platform for KVM VMs. In that time, I have learned a lot, created some tools to simplify management and refined the design to handle corner-cases seen in the field. Today, the culmination of that learning is summed up in the 2nd Edition of that tutorial, now called AN!Cluster Tutorial 2. https://alteeve.ca/w/AN!Cluster_Tutorial_2 These HA KVM platforms have been in production for over two years now in facilities all over the world; Universities, municipal governments, corporate DCs, manufacturing facilities, etc. I've gotten wonderful feedback from users and all that real-world experience has been integrated into this new tutorial. As always, everything is 100% open source and free-as-in-beer! The major changes are: * SELinux and iptables are enabled and used. * Numerous slight changes made to the OS and cluster stack configuration to provide better corner-case fault handling. * Architecture refinements; ** Redundant PSUs, UPSes and fence methods emphasized. ** Monitoring multiple UPSes added via modified apcupsd ** Detailed monitoring of LSI-based RAID controllers and drives ** Discussion on hardware considerations for VM performance based on anticipated work loads * Naming convention changes to support the new AN!CDB dashboard[1] ** New alert system covered with fault and notable event alerting * Wider array of guest OSes are covered; ** Windows 7 ** Windows 8 ** Windows 2008 R2 ** Windows 2012 ** Solaris 11 ** FreeBSD 9 ** RHEL 6 ** SLES 11 Beyond that, the formatting of the tutorial itself has been slightly modified. I do think it is the easiest to follow tutorial I have yet been able to produce. I am very proud of this one! :D As always, feedback is always very much appreciated. Everything from typos/grammar mistakes, functional problems or anything else is very valuable. I take all the feedback I get and use it to helping make the tutorials better. Enjoy! wow, that's a seriously awesome tutorial. Excellent work :) -- Vossel Digimer, who now can now start the next tutorial in earnest! 1. https://alteeve.ca/w/AN!CDB -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] kamailio OCF resource agent for pacemaker
- Original Message - From: WENK Stefan stefan.w...@frequentis.com To: sr-...@lists.sip-router.org, linux-ha@lists.linux-ha.org Sent: Tuesday, January 7, 2014 3:12:58 AM Subject: [Linux-HA] kamailio OCF resource agent for pacemaker Hello, attached you find an initial version of a kamailio OCF compliant resource agent for pacemaker, which is currently running within a prototype laboratory on Redhat Enterprise Linux 6.x. Please keep in mind that it survived testing in a very controlled environment and it is young and some issues/bugs are likely to be found. I was allowed by FREQUENTIS to provide this script to the community under the GPL v2 license with the goal that putting this resources agent under the maintenance of the community, the safer and better it becomes long term. I'd hate to see this work disappear. If there is a community member who is interested testing this agent and working to get it pushed upstream, I'd happily help in the review process. To start this review process, we need someone to create a pull request for the upstream resource-agents git repo. https://github.com/ClusterLabs/resource-agents Thanks, -- Vossel Please note that I won't be able to respond on questions, because my mail account is going to be closed in a few days. Regards, Stefan Wenk Internal note: The released version corresponds with rel_0_19_0. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] FAQ(?): What's the process list?
- Original Message - From: Ulrich Windl ulrich.wi...@rz.uni-regensburg.de To: linux-ha@lists.linux-ha.org Sent: Friday, December 27, 2013 1:29:36 AM Subject: [Linux-HA] FAQ(?): What's the process list? Hi! I wonder what information the following log message contains: corosync[20745]: [pcmk ] info: update_member: Node (...) process list: 00151212 (1380882) Are ther individual bits that describe some features, or how ist this list built? This list consists of the pacemaker components (lrmd, attrd, cib, crmd, stonithd, pengine). Different bits represent different pacemaker components. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker restarts services (on the same node) when failed node returns
- Original Message - From: Peto Michalak peto.micha...@gmail.com To: dvos...@redhat.com, linux-ha linux-ha@lists.linux-ha.org Sent: Wednesday, December 11, 2013 2:26:02 AM Subject: Re: [Linux-HA] pacemaker restarts services (on the same node) when failed node returns Hi David, I've attached a crm_report which should show the restart of services, when failed node returns. I will go through the report as well to see if I find something there. The constraint in the xml below is what causes the restart. You are telling pacemaker place the PGServer on node drpg-02... When drpg-02 joins the cluster, PGserver restarts because it is being relocated to prpg-02. This should be expected. rsc_location id=cli-prefer-PGServer rsc=PGServer rule id=cli-prefer-rule-PGServer score=INFINITY boolean-op=and expression id=cli-prefer-expr-PGServer attribute=#uname operation=eq value=drpg-02 type=string/ /rule /rsc_location Thank you for your help. Best Regards, -Peter Hello, I really searched for the answer before posting : ). I have a pacemaker setup + corosync + drbd in Active/Passive mode running in 2 node cluster on Ubuntu 12.04.3. Everything works fine and on node failure the services are taken care of by the other node (THANKS guys!), well the problem is that I've noticed, that once the failed node comes back alive, the pacemaker restarts the postgresql and virtual IP and that takes around 4-7 seconds (but keeps it on the same node as I wanted, what's the point? :) ). Is this really necessary, or I've messed up something in the configuration? any chance you could provide us with a crm_report during the time frame of this unwanted restart? -- Vossel My pacemaker config: node drpg-01 attributes standby=off node drpg-02 attributes standby=off primitive drbd_pg ocf:linbit:drbd \ params drbd_resource=drpg \ op monitor interval=15 \ op start interval=0 timeout=240 \ op stop interval=0 timeout=120 primitive pg_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/drpg directory=/db/pgdata options=noatime,nodiratime fstype=xfs \ op start interval=0 timeout=60 \ op stop interval=0 timeout=120 primitive pg_lsb lsb:postgresql \ op monitor interval=30 timeout=60 \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 primitive pg_vip ocf:heartbeat:IPaddr2 \ params ip=10.34.2.60 iflabel=pgvip \ op monitor interval=5 group PGServer pg_fs pg_lsb pg_vip ms ms_drbd_pg drbd_pg \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master order ord_pg inf: ms_drbd_pg:promote PGServer:start property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ pe-warn-series-max=1000 \ pe-input-series-max=1000 \ pe-error-series-max=1000 \ default-resource-stickiness=1000 \ cluster-recheck-interval=5min \ stonith-enabled=false \ last-lrm-refresh=1385646505 Thank you. Best Regards, -Peter ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Problem not in our membership
- Original Message - From: Moullé Alain alain.mou...@bull.net To: linux-ha@lists.linux-ha.org Sent: Tuesday, December 10, 2013 1:50:34 AM Subject: Re: [Linux-HA] Problem not in our membership Hi Sorry to ask again about this problem , does somebody has the answer ? Well, it certainly looks like it will fix the error you're seeing. There's only one way to know for sure. Give it a try :) I remember running into things similar to that a couple of years ago. I don't know if that patch was the only one involved. -- Vossel Thanks Alain Le 06/12/2013 08:57, Moullé Alain a écrit : Hi, I've found a thread talking about this problem on 1.1.7, but at the end , is the patch : https://github.com/ClusterLabs/pacemaker/commit/03f6105592281901cc10550b8ad19af4beb5f72f sufficient and correct to solve the problem ? Thanks Alain Le 03/12/2013 10:15, Moullé Alain a écrit : Hi, with : pacemaker-1.1.7-6 corosync-1.4.1-15 On crm migrate , I'm randomly facing this problem : ... node1 daemon warning cib warning: cib_peer_callback: Discarding cib_apply_diff message (342) from node2: not in our membership whereas the node2 is healthy and always member of the cluster. Is-it a known problem ? Is there already a patch ? Thanks Alain ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] pacemaker restarts services (on the same node) when failed node returns
- Original Message - From: Peto Michalak peto.micha...@gmail.com To: linux-ha@lists.linux-ha.org Sent: Monday, December 9, 2013 1:14:23 PM Subject: [Linux-HA] pacemaker restarts services (on the same node) when failed node returns Hello, I really searched for the answer before posting : ). I have a pacemaker setup + corosync + drbd in Active/Passive mode running in 2 node cluster on Ubuntu 12.04.3. Everything works fine and on node failure the services are taken care of by the other node (THANKS guys!), well the problem is that I've noticed, that once the failed node comes back alive, the pacemaker restarts the postgresql and virtual IP and that takes around 4-7 seconds (but keeps it on the same node as I wanted, what's the point? :) ). Is this really necessary, or I've messed up something in the configuration? any chance you could provide us with a crm_report during the time frame of this unwanted restart? -- Vossel My pacemaker config: node drpg-01 attributes standby=off node drpg-02 attributes standby=off primitive drbd_pg ocf:linbit:drbd \ params drbd_resource=drpg \ op monitor interval=15 \ op start interval=0 timeout=240 \ op stop interval=0 timeout=120 primitive pg_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd/by-res/drpg directory=/db/pgdata options=noatime,nodiratime fstype=xfs \ op start interval=0 timeout=60 \ op stop interval=0 timeout=120 primitive pg_lsb lsb:postgresql \ op monitor interval=30 timeout=60 \ op start interval=0 timeout=60 \ op stop interval=0 timeout=60 primitive pg_vip ocf:heartbeat:IPaddr2 \ params ip=10.34.2.60 iflabel=pgvip \ op monitor interval=5 group PGServer pg_fs pg_lsb pg_vip ms ms_drbd_pg drbd_pg \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master order ord_pg inf: ms_drbd_pg:promote PGServer:start property $id=cib-bootstrap-options \ dc-version=1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ pe-warn-series-max=1000 \ pe-input-series-max=1000 \ pe-error-series-max=1000 \ default-resource-stickiness=1000 \ cluster-recheck-interval=5min \ stonith-enabled=false \ last-lrm-refresh=1385646505 Thank you. Best Regards, -Peter ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA
- Original Message - From: Lars Marowsky-Bree l...@suse.com To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Wednesday, December 4, 2013 3:49:17 AM Subject: Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA On 2013-12-04T10:25:58, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: You thought it was working, but in fact it wasn't. ;-) working meaning the resource started. not working meaning the resource does not start You see I have minimal requirements ;-) I'm sorry; we couldn't possibly test all misconfigurations. So this slipped through, we didn't expect someone to set that for a non-clustered VG previously. Updates have been made to the LVM agent to allow exclusive activation without clvmd. http://www.davidvossel.com/wiki/index.php?title=HA_LVM -- Vossel You could argue that it never should have worked. Anyway: If you want to activate a VG on exactly one node you should not need cLVM; only if you man to activate the VG on multiple nodes (as for a cluster file system)... You don't need cLVM to activate a VG on exactly one node. Correct. And you don't. The cluster stack will never activate a resource twice. Occasionally two safty lines are better than one. We HAD filesystem corruptions due to the cluster doing things it shouldn't do. And that's perfectly fine. All you need to do to activate this is vgchange -c y on the specific volume group, and the exclusive=true flag will work just fine. If you don't want that to happen, exclusive=true is not what you want to set. That makes sense, but what I don't like is that I have to mess with local lvm.conf files... You don't. Just drop exclusive=true, or set the clustered flag on the VG. You only have to change anything in the lvm.conf if you want to use tags for exclusivity protection (I defer to the LVM RA help for how to use that, I've never tried it). Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: establishing a new resource-agent package provider
- Original Message - From: Andrew Beekhof and...@beekhof.net To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Tuesday, July 30, 2013 7:46:25 AM Subject: Re: [Linux-HA] Antw: Re: establishing a new resource-agent package provider On 30/07/2013, at 4:21 PM, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote: David Vossel dvos...@redhat.com schrieb am 30.07.2013 um 01:20 in Nachricht 1719265415.18819216.1375140025306.javamail.r...@redhat.com: [...] How does this compare to the Red Hat fence/resource-agent packages? I'm very happy to see heartbeat and it's inherent confusion go away, so I am fundamentally for this. I only question core and how it will relate to those fence and resource agents. core would only be related to the ocf standard. I don't think this should have any relation to the fence agents. [...] I wonder: ocf:base:... or ocf:standard:... instaed of ocf:core:... My personal associations are a bit like this: core == essential base == basic functions many are not basic standard == somewhat standardized nor are they a standard (although they do conform to one)... they're just the ones that the people upstream ships. I like this one the least. yeah, standard sounds confusing to me. Splitting agents between core and base is going to be difficult as well. If we did something like that, I'd probably want to do 'core' and 'extended'... where core supported agents are ones the community takes ownership of, and extended agents are agents that exist in the project, but are only maintained by a subset of the community. I don't really want to do something like this though. common perhaps? perhaps, but I still prefer 'core' over 'common' These are the 'core' resource agents that the ocf community supports. Agents outside of the 'core' provider are supported by different projects and subsets of the community (like linbit and the drbd agent). To me 'common' refers to something that is shared... like a library or something. That probably isn't what we're going for. -- Vossel I don't much care beyond saying that continuing to call them heartbeat is a continuing source of confusion to people just arriving to our set of projects. Calling them heartbeat made sense originally, but now its an historical anachronism. IMHO. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] establishing a new resource-agent package provider
hey, Historically the ocf resource agents have been shipped under the 'heartbeat' provider alias. Now that pacemaker exists, the legacy name heartbeat is slightly confusing since it refers to another project. We should change this. How would you all feel about moving all the 'heartbeat' provider agents into a new provider called 'core', and then for legacy purposes create a 'heartbeat' symlink that points to the 'core' directory so no one's configuration breaks... Eventually some day we could move in the direction of depreciating the use of the 'heartbeat' provider entirely. good plan? any thoughts? -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] establishing a new resource-agent package provider
- Original Message - From: Digimer li...@alteeve.ca To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Cc: David Vossel dvos...@redhat.com Sent: Monday, July 29, 2013 5:21:00 PM Subject: Re: [Linux-HA] establishing a new resource-agent package provider On 29/07/13 18:19, David Vossel wrote: hey, Historically the ocf resource agents have been shipped under the 'heartbeat' provider alias. Now that pacemaker exists, the legacy name heartbeat is slightly confusing since it refers to another project. We should change this. How would you all feel about moving all the 'heartbeat' provider agents into a new provider called 'core', and then for legacy purposes create a 'heartbeat' symlink that points to the 'core' directory so no one's configuration breaks... Eventually some day we could move in the direction of depreciating the use of the 'heartbeat' provider entirely. good plan? any thoughts? -- Vossel How does this compare to the Red Hat fence/resource-agent packages? I'm very happy to see heartbeat and it's inherent confusion go away, so I am fundamentally for this. I only question core and how it will relate to those fence and resource agents. core would only be related to the ocf standard. I don't think this should have any relation to the fence agents. -- Vossel -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] PCS and ping resources?
- Original Message - From: Jakob Curdes j...@info-systems.de To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Sunday, June 30, 2013 6:04:58 AM Subject: [Linux-HA] PCS and ping resources? Hello, I have configured a cluster on CentOS 6.x using PCS. All fine, but I miss the information how to create ping primitives and use them to ensure connectivity for the active machine. With crmsh this was done configuring a ocf:pacemaker:ping primitive and a clone; I could not figure out how to do this with PCS. $ pcs resource help . . . create resource id class:provider:type|type [resource options] [op operation action operation options [operation action operation options]...] [meta meta options...] [--clone|--master] Create specified resource. If --clone is used a clone resource is created (with options specified by --cloneopt clone_option=value), if --master is specified a master/slave resource is created. . . . I'm guessing the following would work. $ pcs resource pingrsc ping --clone -- Vossel I could not find any document describing this. Did I miss something? Regards, Jakob Curdes ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: linux-ha@lists.linux-ha.org Sent: Tuesday, June 18, 2013 4:30:30 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Mon, Jun 17, 2013 at 05:53:57PM -0400, David Vossel wrote: The plan to set 'volume_list=[@*]' in lvm.conf and override the tags during the activation using vgchange -ay --config 'tags{ mytag{} }' vg0 does not work. As a similar alternative, I am forcing the volume_list in lvm.conf to be initialized and not contain the tag the cluster is using for exclusive activation, and then overriding the volume_list during the activation to allow volume groups with the cluster tag to be activated. This is a very similar approach to what Lars original proposed. I have finished my initial work on the new set of patches. They can be found in this pull request. https://github.com/ClusterLabs/resource-agents/pull/252 This patch is going in tomorrow. Speak up now if you have any reservations concerning it. Manny thanks for all the work, as tomorrow has passed, I just merged it. Added a tag parameter description. I'm not sure why we would need to strip the tag on stop though? I see stripping the tag on stop as cleaning up the magic the LVM agent is doing behind the scenes. The tag is only needed to allow activation, activation is still prevented outside of the cluster management with or without the tag being present on the vg. Or why we would override a different tag on start. As the cluster tag is not supposed to change, we could just require the admin to set it once. Yep, we could do this. It puts another step in the admin's hands that we can automate though. Has the side-effect that an admin can revoke the cluster rights by simply re-tagging with something != the cluster tag. Do we still want the single-lv activation feature as well? I gave this a lot of thought. single-lv activation adds additional complexity to this whole exclusive activation with tags. Rather than complicate the situation any further, I am in favor of not pulling support for lv activation to the heartbeat agent. This agent is already complex to manage as it is. -- Vossel -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: David Vossel dvos...@redhat.com To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Tuesday, June 4, 2013 4:41:06 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation - Original Message - From: David Vossel dvos...@redhat.com To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Monday, June 3, 2013 10:50:01 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: linux-ha@lists.linux-ha.org Sent: Tuesday, May 21, 2013 5:58:05 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 21, 2013 at 05:52:39PM -0400, David Vossel wrote: - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Brassow Jonathan jbras...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Monday, May 20, 2013 3:50:49 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Fri, May 17, 2013 at 02:00:48PM -0500, Brassow Jonathan wrote: On May 17, 2013, at 10:14 AM, Lars Ellenberg wrote: On Thu, May 16, 2013 at 10:42:30AM -0400, David Vossel wrote: The use of 'auto_activation_volume_list' depends on updates to the LVM initscripts - ensuring that they use '-aay' in order to activate logical volumes. That has been checked in upstream. I'm sure it will go into RHEL7 and I think (but would need to check on) RHEL6. Only that this is upstream here, so it better work with debian oldstale, gentoo or archlinux as well ;-) Would this be good enough: vgchange --addtag pacemaker $VG and NOT mention the pacemaker tag anywhere in lvm.conf ... then, in the agent start action, vgchange -ay --config tags { pacemaker {} } $VG (or have the to be used tag as an additional parameter) No retagging necessary. How far back do the lvm tools understand the --config ... option? --config option goes back years and years - not sure of the exact date, but could probably tell with 'git bisect' if you wanted me to. The above would not quite be sufficient. You would still have to change the 'volume_list' field in lvm.conf (and update the initrd). You have to do that anyways if you want to make use of tags in this way? What you are proposing would simplify things in that you would not need different 'volume_list's on each machine - you could copy configs between machines. I thought volume_list = [ ... , @* ] in lvm.conf, assuming that works on all relevant distributions as well, and a command line --config tag would also propagate into that @*. It did so for me. But yes, vlumen_list = [ ... , pacemaker ] would be fine as well. wait, did we just go around in a circle. If we add pacemaker to the volume list, and use that in every cluster node's config, then we've by-passed the exclusive activation part have we not?! No. I suggested to NOT set that pacemaker tag in the config (lvm.conf), but only ever explicitly set that tag from the command line as used from the resource agent ( --config tags { pacemaker {} } ) That would also mean to either override volume_list with the same command line, or to have the tag mentioned in the volume_list in lvm.conf (but not set it in the tags {} section). Also, we're not happy with the auto_activate list because it won't work with old distros?! It's a new feature, why do we have to work with old distros that don't support it? You are right, we only have to make sure we don't break existing setup by rolling out a new version of the RA. So if the resource agent won't accidentally use a code path where support of a new feature (of LVM) would be required, that's good enough compatibility. Still it won't hurt to pick the most compatible implementation of several possible equivalent ones (RA-feature wise). I think the proposed --config tags { pacemaker {} } is simpler (no retagging, no re-writing of lvm meta data), and will work for any setup that knows about tags. I've had a good talk with Jonathan about the --config tags { pacemaker {} } approach. This was originally complicated for us because we were using the --config option for a device filter during activation in certain situations... using the --config option twice caused problems which made adding the tag in the config difficult. We've
Re: [Linux-HA] Antw: ocf HA_RSCTMP directory location
- Original Message - From: Ulrich Windl ulrich.wi...@rz.uni-regensburg.de To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Friday, June 14, 2013 1:34:58 AM Subject: [Linux-HA] Antw: ocf HA_RSCTMP directory location Hi! I think the location of the temporary directory is not that important, It is because pacemaker has to make sure the directory exists every time it starts up. Otherwise agents will fail. because you don't exchange data between RAs. For the design, I think it's sufficient if an RA that actually uses the temporary directory does check ist existence (for validate-all, maybe). But even that is not necessary, because if the RA cannot write it's PID where it wanted to, the star toperation should fail, and the user will notice the problem. Amazingly I had seen a situation with samba recently, where smbd start exited OK, but did not start, because the PID file (an obsolete one) already existed. I had to remove the PID file manually. Clearly a bug in samba. If the temp directories don't get cleaned out on restart, this is a possibility. I think RAs should not rely on the fact that temp directories are clean when a resource is going to be started. The resource tmp directory has to get cleaned out on startup, if it doesn't I don't think there is a good solution for resource agents to detect a stale pid file from one that is current. Nearly all the agents depend on this tmp directory to get reinitialized. If we decided not to depend on this logic, every agent would have to be altered to account for this. This would mean adding a layer of complexity to the agents that should otherwise be unnecessary. -- Vossel Regards, Ulrich David Vossel dvos...@redhat.com schrieb am 13.06.2013 um 23:59 in Nachricht 194863966.11352160.1371160774187.javamail.r...@redhat.com: Hey, Andrew and I have been running into some inconsistencies between resource-agent packages that we need to get cleared up. There's an ocf variable, HA_RSCTMP, used in many of the resource agents that represents a place the agent's can store their PID files and other temporary data. This data needs to live under some directory in /var/run as that directory is typically cleared on startup. This is important to prevent stale PID files and other transient data from being persistent across restarts. Anyway. Here's the problem. Pacemaker thinks that data should live in '/var/run/heartbeat/rsctmp', but not all the resource-agent packages are consistent with that. For example, Suse's resource-agent package sets HA_RSCTMP to '/var/run/resource-agents' ( looking at this rpm, http://download.opensuse.org/distribution/11.4/repo/oss/suse/x86_64/resource-a gents-1.0.3-9.12.1.x86_64.rpm ) We need to come to some sort of agreement because ultimately Pacemaker needs to make sure this directory exists on startup, whatever it is. If pacemaker doesn't create the right directory, it's possible the resource agents won't be able to access it since /var/run is re-initialized on startup. so, HA_RSCTMP = /var/run/heartbeat/rsctmp or HA_RSCTMP = /var/run/resource-agents thoughts? -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ocf HA_RSCTMP directory location
Hey, Andrew and I have been running into some inconsistencies between resource-agent packages that we need to get cleared up. There's an ocf variable, HA_RSCTMP, used in many of the resource agents that represents a place the agent's can store their PID files and other temporary data. This data needs to live under some directory in /var/run as that directory is typically cleared on startup. This is important to prevent stale PID files and other transient data from being persistent across restarts. Anyway. Here's the problem. Pacemaker thinks that data should live in '/var/run/heartbeat/rsctmp', but not all the resource-agent packages are consistent with that. For example, Suse's resource-agent package sets HA_RSCTMP to '/var/run/resource-agents' ( looking at this rpm, http://download.opensuse.org/distribution/11.4/repo/oss/suse/x86_64/resource-agents-1.0.3-9.12.1.x86_64.rpm ) We need to come to some sort of agreement because ultimately Pacemaker needs to make sure this directory exists on startup, whatever it is. If pacemaker doesn't create the right directory, it's possible the resource agents won't be able to access it since /var/run is re-initialized on startup. so, HA_RSCTMP = /var/run/heartbeat/rsctmp or HA_RSCTMP = /var/run/resource-agents thoughts? -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Last call: removal of ocf:heartbeat:drbd in favor of ocf:linbit:heartbeat
- Original Message - From: Ulrich Windl ulrich.wi...@rz.uni-regensburg.de To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Wednesday, June 12, 2013 1:00:45 AM Subject: [Linux-HA] Antw: Last call: removal of ocf:heartbeat:drbd in favor of ocf:linbit:heartbeat Besides merge or remove you could move it to ocf:unsupported:drbd ;-) I'd like it to disappear entirely. The drbd agent is supported, just not by the heartbeat provider. If there wasn't a duplicate agent already, this would make sense. -- Vossel David Vossel dvos...@redhat.com schrieb am 11.06.2013 um 17:15 in Nachricht 624463553.10357145.1370963709261.javamail.r...@redhat.com: Hey We need to get rid of this heartbeat drbd agent. It is outdated and linbit isn't supporting it. Instead linbit is shipping their own supported drbd agent in their own package using the linbit ocf provider. Any distro that is still using the heartbeat:drbd agent exclusively is old enough that a rebase of the resource-agents package is unlikely. Unless someone steps forward with a good argument against the removal of the heartbeat:drbd agent, the following pull request is going to be merged a week from today. (Tuesday 18th) https://github.com/ClusterLabs/resource-agents/pull/244 thanks, -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Master/Slave status check using crm_mon
- Original Message - From: John M john332...@gmail.com To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Wednesday, June 12, 2013 11:49:21 AM Subject: Re: [Linux-HA] Master/Slave status check using crm_mon Dear All, I will try to setup pacemaker cluster in the coming weeks. Before that I have to complete the configuration using heartbeat 2.1.4. I would really appreciate if you could suggest the configuration for Master/Slave scenario mentioned in my previous mail. http://clusterlabs.org/doc/ look through clusters from scratch. Read about multi-state resources in Pacemaker Explained. -- Vossel Thanks in advance. BR, Mark On Tuesday, June 11, 2013, Lars Marowsky-Bree l...@suse.com wrote: On 2013-06-11T15:05:11, John M john332...@gmail.com wrote: Unfortunately I cannot install pacemaker :( I just installed heartbeat 2.1.4 and in crm_mon I am getting Master/Slave status. You seriously need to upgrade. Heartbeat 2.1.4 is ages old and has many, many known bugs. You'll not be able to secure community aid for that version any more. Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Last call: removal of ocf:heartbeat:drbd in favor of ocf:linbit:heartbeat
Hey We need to get rid of this heartbeat drbd agent. It is outdated and linbit isn't supporting it. Instead linbit is shipping their own supported drbd agent in their own package using the linbit ocf provider. Any distro that is still using the heartbeat:drbd agent exclusively is old enough that a rebase of the resource-agents package is unlikely. Unless someone steps forward with a good argument against the removal of the heartbeat:drbd agent, the following pull request is going to be merged a week from today. (Tuesday 18th) https://github.com/ClusterLabs/resource-agents/pull/244 thanks, -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: linux-ha@lists.linux-ha.org Sent: Tuesday, May 21, 2013 5:58:05 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 21, 2013 at 05:52:39PM -0400, David Vossel wrote: - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Brassow Jonathan jbras...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Monday, May 20, 2013 3:50:49 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Fri, May 17, 2013 at 02:00:48PM -0500, Brassow Jonathan wrote: On May 17, 2013, at 10:14 AM, Lars Ellenberg wrote: On Thu, May 16, 2013 at 10:42:30AM -0400, David Vossel wrote: The use of 'auto_activation_volume_list' depends on updates to the LVM initscripts - ensuring that they use '-aay' in order to activate logical volumes. That has been checked in upstream. I'm sure it will go into RHEL7 and I think (but would need to check on) RHEL6. Only that this is upstream here, so it better work with debian oldstale, gentoo or archlinux as well ;-) Would this be good enough: vgchange --addtag pacemaker $VG and NOT mention the pacemaker tag anywhere in lvm.conf ... then, in the agent start action, vgchange -ay --config tags { pacemaker {} } $VG (or have the to be used tag as an additional parameter) No retagging necessary. How far back do the lvm tools understand the --config ... option? --config option goes back years and years - not sure of the exact date, but could probably tell with 'git bisect' if you wanted me to. The above would not quite be sufficient. You would still have to change the 'volume_list' field in lvm.conf (and update the initrd). You have to do that anyways if you want to make use of tags in this way? What you are proposing would simplify things in that you would not need different 'volume_list's on each machine - you could copy configs between machines. I thought volume_list = [ ... , @* ] in lvm.conf, assuming that works on all relevant distributions as well, and a command line --config tag would also propagate into that @*. It did so for me. But yes, vlumen_list = [ ... , pacemaker ] would be fine as well. wait, did we just go around in a circle. If we add pacemaker to the volume list, and use that in every cluster node's config, then we've by-passed the exclusive activation part have we not?! No. I suggested to NOT set that pacemaker tag in the config (lvm.conf), but only ever explicitly set that tag from the command line as used from the resource agent ( --config tags { pacemaker {} } ) That would also mean to either override volume_list with the same command line, or to have the tag mentioned in the volume_list in lvm.conf (but not set it in the tags {} section). Also, we're not happy with the auto_activate list because it won't work with old distros?! It's a new feature, why do we have to work with old distros that don't support it? You are right, we only have to make sure we don't break existing setup by rolling out a new version of the RA. So if the resource agent won't accidentally use a code path where support of a new feature (of LVM) would be required, that's good enough compatibility. Still it won't hurt to pick the most compatible implementation of several possible equivalent ones (RA-feature wise). I think the proposed --config tags { pacemaker {} } is simpler (no retagging, no re-writing of lvm meta data), and will work for any setup that knows about tags. I've had a good talk with Jonathan about the --config tags { pacemaker {} } approach. This was originally complicated for us because we were using the --config option for a device filter during activation in certain situations... using the --config option twice caused problems which made adding the tag in the config difficult. We've worked through those situations and it looks like it is actually safe to strip out the conflicting --config usage required for the resilient device filtering on activation. The path forward is this. 1. Now that I know certain things are safe to remove, I'm going to re-evaluate my current patches and attempt to greatly simplify the number of changes to the original LVM agent. All the looking at the cluster membership and resilient activation checking can be thrown out. 2. Next I'm going to introduce the proposed --config tags feature as a separate patch that enables exclusive activation functionality without clvmd. The final result here is going to be a much less scary looking set of changes than what I currently have up
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Marowsky-Bree l...@suse.com To: linux-ha@lists.linux-ha.org Cc: Jonathan Brassow jbras...@redhat.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Friday, May 24, 2013 3:35:29 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On 2013-05-15T13:50:45, Lars Ellenberg lars.ellenb...@linbit.com wrote: Are we, in this discussion, perhaps losing the focus on the base submission of the code merge? Can we separate that (IMHO rather worthwhile) patch set from the exclusive activation part? I don't think we are losing focus, I think we're really close to finalizing the final bits here. If no one objects to the idea you proposed about using a single tag for the entire cluster to ensure exclusive activation, the patches remain similar to the way it is now, I just strip out a bunch of unnecessary stuff. I'm waiting to hear more feedback from Jonathan about this new direction before I act on anything. -- Vossel (Which I happen to have no strong opinion on, unless it is already shipped - in which case we need to support it.) Regards, Lars -- Architect Storage/HA SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg) Experience is the name everyone gives to their mistakes. -- Oscar Wilde ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] vm live migration without shared storage
Hey, I've been testing libvirt live migration without shared storage in Fedora 19 alpha. Specifically this feature, https://fedoraproject.org/wiki/Features/Virt_Storage_Migration. They've done a lot of work to make it a much more solid option than it has been in the past. I had to work through a few bugs with the guys over there, but it's working great for me now. So, do we want to use this in HA? It would be trivial to add this to the VirtualDomain resource agent, but does it make sense to do this? Migration time, depending on network speed and hardware, is much longer than the shared storage option (minutes vs. seconds). I don't mind adding this support to the agent, but I wanted to get peoples' feedback and make sure this is something we want before making that effort. -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] vm live migration without shared storage
- Original Message - From: Greg Woods wo...@ucar.edu To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Thursday, May 23, 2013 2:45:16 PM Subject: Re: [Linux-HA] vm live migration without shared storage On Thu, 2013-05-23 at 15:00 -0400, David Vossel wrote: Migration time, depending on network speed and hardware, is much longer than the shared storage option (minutes vs. seconds). This is just one data point (of course), but for the vast majority of services that I run, if the live migration time is as long as it takes to shut down a VM and boot it on another server, then there isn't much of an advantage to doing the live migration. Especially if we're talking about an option that is a long way from being battle-tested, and critical services such as DNS and authentication. Most of these critical services do not use long-lived connections. I can see a few VMs that exist to provide ssh logins where a minutes-long live migration would be clearly preferable to a shut down and reboot, but in most cases, if it's as slow as rebooting, it isn't going to be any advantage to me. It will be interesting though to see how many applications people come up with where a minutes-long live migration is preferable to shutdown and reboot. The actual migration takes awhile, but the transition between running on the source and running on the destination should be very fast. The source stays running while the disk is being copied to the destination, once the copy is complete it's like flipping a switch... at least that's my understanding. -- Vossel --Greg ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Brassow Jonathan jbras...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Monday, May 20, 2013 3:50:49 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Fri, May 17, 2013 at 02:00:48PM -0500, Brassow Jonathan wrote: On May 17, 2013, at 10:14 AM, Lars Ellenberg wrote: On Thu, May 16, 2013 at 10:42:30AM -0400, David Vossel wrote: The use of 'auto_activation_volume_list' depends on updates to the LVM initscripts - ensuring that they use '-aay' in order to activate logical volumes. That has been checked in upstream. I'm sure it will go into RHEL7 and I think (but would need to check on) RHEL6. Only that this is upstream here, so it better work with debian oldstale, gentoo or archlinux as well ;-) Would this be good enough: vgchange --addtag pacemaker $VG and NOT mention the pacemaker tag anywhere in lvm.conf ... then, in the agent start action, vgchange -ay --config tags { pacemaker {} } $VG (or have the to be used tag as an additional parameter) No retagging necessary. How far back do the lvm tools understand the --config ... option? --config option goes back years and years - not sure of the exact date, but could probably tell with 'git bisect' if you wanted me to. The above would not quite be sufficient. You would still have to change the 'volume_list' field in lvm.conf (and update the initrd). You have to do that anyways if you want to make use of tags in this way? What you are proposing would simplify things in that you would not need different 'volume_list's on each machine - you could copy configs between machines. I thought volume_list = [ ... , @* ] in lvm.conf, assuming that works on all relevant distributions as well, and a command line --config tag would also propagate into that @*. It did so for me. But yes, vlumen_list = [ ... , pacemaker ] would be fine as well. wait, did we just go around in a circle. If we add pacemaker to the volume list, and use that in every cluster node's config, then we've by-passed the exclusive activation part have we not?! Also, we're not happy with the auto_activate list because it won't work with old distros?! It's a new feature, why do we have to work with old distros that don't support it? -- Vossel Lars ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: linux-ha@lists.linux-ha.org Sent: Tuesday, May 21, 2013 5:58:05 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 21, 2013 at 05:52:39PM -0400, David Vossel wrote: - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Brassow Jonathan jbras...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Monday, May 20, 2013 3:50:49 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Fri, May 17, 2013 at 02:00:48PM -0500, Brassow Jonathan wrote: On May 17, 2013, at 10:14 AM, Lars Ellenberg wrote: On Thu, May 16, 2013 at 10:42:30AM -0400, David Vossel wrote: The use of 'auto_activation_volume_list' depends on updates to the LVM initscripts - ensuring that they use '-aay' in order to activate logical volumes. That has been checked in upstream. I'm sure it will go into RHEL7 and I think (but would need to check on) RHEL6. Only that this is upstream here, so it better work with debian oldstale, gentoo or archlinux as well ;-) Would this be good enough: vgchange --addtag pacemaker $VG and NOT mention the pacemaker tag anywhere in lvm.conf ... then, in the agent start action, vgchange -ay --config tags { pacemaker {} } $VG (or have the to be used tag as an additional parameter) No retagging necessary. How far back do the lvm tools understand the --config ... option? --config option goes back years and years - not sure of the exact date, but could probably tell with 'git bisect' if you wanted me to. The above would not quite be sufficient. You would still have to change the 'volume_list' field in lvm.conf (and update the initrd). You have to do that anyways if you want to make use of tags in this way? What you are proposing would simplify things in that you would not need different 'volume_list's on each machine - you could copy configs between machines. I thought volume_list = [ ... , @* ] in lvm.conf, assuming that works on all relevant distributions as well, and a command line --config tag would also propagate into that @*. It did so for me. But yes, vlumen_list = [ ... , pacemaker ] would be fine as well. wait, did we just go around in a circle. If we add pacemaker to the volume list, and use that in every cluster node's config, then we've by-passed the exclusive activation part have we not?! No. I suggested to NOT set that pacemaker tag in the config (lvm.conf), but only ever explicitly set that tag from the command line as used from the resource agent ( --config tags { pacemaker {} } ) That would also mean to either override volume_list with the same command line, or to have the tag mentioned in the volume_list in lvm.conf (but not set it in the tags {} section). Also, we're not happy with the auto_activate list because it won't work with old distros?! It's a new feature, why do we have to work with old distros that don't support it? You are right, we only have to make sure we don't break existing setup by rolling out a new version of the RA. So if the resource agent won't accidentally use a code path where support of a new feature (of LVM) would be required, that's good enough compatibility. Still it won't hurt to pick the most compatible implementation of several possible equivalent ones (RA-feature wise). I think the proposed --config tags { pacemaker {} } is simpler (no retagging, no re-writing of lvm meta data), and will work for any setup that knows about tags. (and I still think the RA should not try to second guess pacemaker and double check the membership... which in the case of this cluster wide tag would not make sense anymore, anyways) Of course this cluster wide tag pacemaker could be made configurable itself, so not all clusters in the world using this feature would use the same tag. primitive ... params exclusive=1 (implicitly does the right thing, and internally guesses whatever that may be, several times scanning and parsing LVM meta data just for that) becomes primitive ... params tag=weather-forecast-scratch-01 (explicitly knows to use the _tag variants of start/monitor/stop, no parsing and guessing, not try and retry, but just do-it-or-fail) Does that make sense at all? yes, we are on the same page now. I am in favor of this approach. I am still confused about Jonathan's comment though when you proposed this solution... The above would not quite be sufficient. You would still have to change the 'volume_list' field in lvm.conf (and update the initrd). Why would
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Brassow Jonathan jbras...@redhat.com To: David Vossel dvos...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com, Jonathan Brassow jbras...@redhat.com Sent: Thursday, May 16, 2013 9:32:38 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On May 16, 2013, at 9:08 AM, David Vossel wrote: - Original Message - From: Brassow Jonathan jbras...@redhat.com To: David Vossel dvos...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com, Jonathan Brassow jbras...@redhat.com Sent: Thursday, May 16, 2013 8:37:08 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On May 15, 2013, at 7:04 PM, David Vossel wrote: - Original Message - From: Brassow Jonathan jbras...@redhat.com To: David Vossel dvos...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Tuesday, May 14, 2013 5:01:02 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On May 14, 2013, at 10:36 AM, David Vossel wrote: - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Lars Marowsky-Bree l...@suse.com Cc: Fabio M. Di Nitto fdini...@redhat.com, General Linux-HA mailing list linux-ha@lists.linux-ha.org, Jonathan Brassow jbras...@redhat.com Sent: Tuesday, May 14, 2013 9:50:43 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 14, 2013 at 04:06:09PM +0200, Lars Marowsky-Bree wrote: On 2013-05-14T09:54:55, David Vossel dvos...@redhat.com wrote: Here's what it comes down to. You aren't guaranteed exclusive activation just because pacemaker is in control. There are scenarios with SAN disks where the node starts up and can potentially attempt to activate a volume before pacemaker has initialized. Yeah, from what I've read in the code, the tagged activation would also prevent a manual (or on-boot) vg/lv activation (because it seems lvm itself will refuse). That seems like a good idea to me. Unless I'm wrong, that concept seems sound, barring bugs that need fixing. Sure. And I'm not at all oposed to using tags. I want to get rid of the layer violation, which is the one Bad Thing I'm complaining about. Also, note that on stop, this strips all tags, leaving it untagged. On the next cluster boot, if that was really the concern, all nodes would grab and activate the VG, as it is untagged... That's not how it works. You have to take ownership of the volume before you can activate it. Untagged does not mean a node can activate it without first explicitly setting the tag. Ok, so I'm coming into this late. Sorry about that. David has this right. Tagging in conjunction with the 'volume_list' setting in lvm.conf is what is used to restrict VG/LV activation. As he mentioned, you don't want a machine to boot up and start doing a resync on a mirror while user I/O is happening on the node where the service is active. In that scenario, even if the LV is not mounted, there will be corruption. The LV must not be allowed activation in the first place. I think the HA scripts written for rgmanager could be considerably reduced in size. We probably don't need the matrix of different methods (cLVM vs Tagging. VG vs LV). Many of these came about as customers asked for them and we didn't want to compromise backwards compatibility. If we are switching, now's the time for clean-up. In fact, LVM has something new in lvm.conf: 'auto_activation_volume_list'. If the list is defined and a VG/LV is in the list, it will be automatically activated on boot; otherwise, it will not. That means, forget tagging and forget cLVM. Make users change 'auto_activation_volume_list' to include only VGs that are not controlled by pacemaker. The HA script should then make sure that 'auto_activation_volume_list' is defined and does not contain the VG/LV that is being controlled by pacemaker. It would be necessary to check that the lvm.conf copy in the initrd is properly set. The use of 'auto_activation_volume_list' depends on updates to the LVM initscripts - ensuring that they use '-aay' in order to activate logical volumes. That has been checked in upstream. I'm sure it will go into RHEL7 and I think (but would need to check on) RHEL6. The 'auto_activation_volume_list' doesn't seem like it exactly what we want here though. It kind of works for what we are wanting to achieve but as a side effect, and I'm not sure it would work for everyone's deployment. For example, is there a way to set
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Cc: Jonathan Brassow jbras...@redhat.com, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Wednesday, May 15, 2013 6:50:45 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 14, 2013 at 11:36:54AM -0400, David Vossel wrote: - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Lars Marowsky-Bree l...@suse.com Cc: Fabio M. Di Nitto fdini...@redhat.com, General Linux-HA mailing list linux-ha@lists.linux-ha.org, Jonathan Brassow jbras...@redhat.com Sent: Tuesday, May 14, 2013 9:50:43 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 14, 2013 at 04:06:09PM +0200, Lars Marowsky-Bree wrote: On 2013-05-14T09:54:55, David Vossel dvos...@redhat.com wrote: Here's what it comes down to. You aren't guaranteed exclusive activation just because pacemaker is in control. There are scenarios with SAN disks where the node starts up and can potentially attempt to activate a volume before pacemaker has initialized. Yeah, from what I've read in the code, the tagged activation would also prevent a manual (or on-boot) vg/lv activation (because it seems lvm itself will refuse). That seems like a good idea to me. Unless I'm wrong, that concept seems sound, barring bugs that need fixing. Sure. And I'm not at all oposed to using tags. I want to get rid of the layer violation, which is the one Bad Thing I'm complaining about. Also, note that on stop, this strips all tags, leaving it untagged. On the next cluster boot, if that was really the concern, all nodes would grab and activate the VG, as it is untagged... That's not how it works. You have to take ownership of the volume before you can activate it. Untagged does not mean a node can activate it without first explicitly setting the tag. The scenario is this (please correct me): There are a number of hosts that can see the PVs for this VG. Some of them may *not* be part of the pacemaker cluster. I don't expect this to be the case. If the node isn't a member of the cluster and it can see the PVs, then it is likely just starting up after being fenced and will rejoin the cluster shortly. But *ALL* of them have their lvm.conf contain the equivalent of global { locking_type = 1 } tags { hosttags = 1 } activation { volume_list = [ @* ] } If any node is able to see the PVs, but has volume_list undefined, vgchange -ay would activate it anyways. So we are back at Don't do that. Ha, well if people want to shoot themselves in the foot, we can't help that either. The point of the feature is to give people a path to ensure exclusive activation without using clvmd+dlm. The preferred and safest way is still to use clvmd. I've tested this specific scenario and was unable to activate the volume group manually without grabbing the tag first. Have you tested this and found something contrary to my results? This is how the feature is supposed to work. See above ;-) no hosttags =1, no volume_list, no checking against it. Granted, the lvm.conf would be prepared at deployment time, so let's assume it is setup ok on all hosts accross the site anyways. Still, I don't see what we gain by check tag against my name if not me, check tag againts membership list not present strip all tags and add my name This is a redundancy check at the resource level to attempt prevent data corruption. The if not me, and not in member list likely means the owner was a part of the member list at one point and has been fenced. We feel safe taking the tags in that case. If the owner is a member of the cluster still then something is wrong that the admin should investigate. I'm guessing this could mean fencing failed and the admin unblocked the cluster in a weird state. If I was writing this feature for the first time I probably wouldn't have thought to add this check in, but I don't see any harm in leaving it as it appears to serve a purpose. -- Vossel try vgchange -ay instead of doing just strip all tags and add my name try vgchange -ay In what scenario (appart from Pacemaker not being able to count to 1) would the more elaborate version protect us better than the simple one, and against what? -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Announcing Pacemaker Remote - extending high availability outside the cluster stack
Hi, I'm excited to announce the initial development phase of the Pacemaker Remote daemon is complete and ready for testing in Pacemaker v1.1.10rc2 Below is the first draft of a deployment guide that outlines the initial supported Pacemaker Remote use-cases and provides walk-through examples. Note however that Fedora 18 does not have the pacemaker-remote subpackage rpm available even though I do reference it in the documentation (this will be changed in a future draft) You'll have to use the 1.1.10rc2 tag in github for now. Documentation: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Remote/index.html For those of you unfamiliar with Pacemake Remote, the pacemaker_remote service is a new daemon introduced in Pacemaker v1.1.10 that allows nodes not running the cluster stack (pacemaker+corosync) to integrate into the cluster and have the cluster manage their resources just as if they were a real cluster node. This means that Pacemaker clusters are now capable of managing both launching virtual environments (KVM/LXC) as well as launching the resources that live within those virtual environments without requiring the virtual environments to run pacemaker or corosync. Usage of the pacemaker_remote daemon is currently limited to virtual guests such as KVM and Linux Containers, but several future enhancements to include additional use-cases are in the works. These planned future enhancements include the following. - Libvirt Sandbox Support Once the libvirt-sandbox project is integrated with pacemaker_remote, we will gain the ability to preform per-resource linux container isolation with very little performance impact. This functionality will allow resources living on a single node to be isolated from one another. At that point CPU and memory limits could be set per-resource in the cluster dynamically just using the cluster config. - Bare-metal Support The pacemaker_remote daemon already has the ability to run on bare-metal hardware nodes, but the policy engine logic for integrating bare-metal nodes is not complete. There are some complications involved with understanding a bare-metal node's state that virtual nodes don't have. Once this logic is complete, pacemaker will be able to integrate bare-metal nodes in the same way virtual remote-nodes currently are. Some special considerations for fencing will need to be addressed. - Virtual Remote-node Migration Support Pacemaker's policy engine is limited in its ability to perform live migrations of KVM resources when resource dependencies are involved. This limitation affects how resources living within a KVM remote-node are handled when a live migration takes place. Currently when a live migration is performed on a KVM remote-node, all the resources within that remote-node have to be stopped before the migration takes place and started once again after migration has finished. This policy engine limitation is fully explained in this bug report, http://bugs.clusterlabs.org/show_bug.cgi?id=5055#c3 -- David Vossel dvos...@redhat.com irc: dvossel on irc.freenode.net ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Brassow Jonathan jbras...@redhat.com To: David Vossel dvos...@redhat.com Cc: General Linux-HA mailing list linux-ha@lists.linux-ha.org, Lars Marowsky-Bree l...@suse.com, Fabio M. Di Nitto fdini...@redhat.com Sent: Tuesday, May 14, 2013 5:01:02 PM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On May 14, 2013, at 10:36 AM, David Vossel wrote: - Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Lars Marowsky-Bree l...@suse.com Cc: Fabio M. Di Nitto fdini...@redhat.com, General Linux-HA mailing list linux-ha@lists.linux-ha.org, Jonathan Brassow jbras...@redhat.com Sent: Tuesday, May 14, 2013 9:50:43 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 14, 2013 at 04:06:09PM +0200, Lars Marowsky-Bree wrote: On 2013-05-14T09:54:55, David Vossel dvos...@redhat.com wrote: Here's what it comes down to. You aren't guaranteed exclusive activation just because pacemaker is in control. There are scenarios with SAN disks where the node starts up and can potentially attempt to activate a volume before pacemaker has initialized. Yeah, from what I've read in the code, the tagged activation would also prevent a manual (or on-boot) vg/lv activation (because it seems lvm itself will refuse). That seems like a good idea to me. Unless I'm wrong, that concept seems sound, barring bugs that need fixing. Sure. And I'm not at all oposed to using tags. I want to get rid of the layer violation, which is the one Bad Thing I'm complaining about. Also, note that on stop, this strips all tags, leaving it untagged. On the next cluster boot, if that was really the concern, all nodes would grab and activate the VG, as it is untagged... That's not how it works. You have to take ownership of the volume before you can activate it. Untagged does not mean a node can activate it without first explicitly setting the tag. Ok, so I'm coming into this late. Sorry about that. David has this right. Tagging in conjunction with the 'volume_list' setting in lvm.conf is what is used to restrict VG/LV activation. As he mentioned, you don't want a machine to boot up and start doing a resync on a mirror while user I/O is happening on the node where the service is active. In that scenario, even if the LV is not mounted, there will be corruption. The LV must not be allowed activation in the first place. I think the HA scripts written for rgmanager could be considerably reduced in size. We probably don't need the matrix of different methods (cLVM vs Tagging. VG vs LV). Many of these came about as customers asked for them and we didn't want to compromise backwards compatibility. If we are switching, now's the time for clean-up. In fact, LVM has something new in lvm.conf: 'auto_activation_volume_list'. If the list is defined and a VG/LV is in the list, it will be automatically activated on boot; otherwise, it will not. That means, forget tagging and forget cLVM. Make users change 'auto_activation_volume_list' to include only VGs that are not controlled by pacemaker. The HA script should then make sure that 'auto_activation_volume_list' is defined and does not contain the VG/LV that is being controlled by pacemaker. It would be necessary to check that the lvm.conf copy in the initrd is properly set. The use of 'auto_activation_volume_list' depends on updates to the LVM initscripts - ensuring that they use '-aay' in order to activate logical volumes. That has been checked in upstream. I'm sure it will go into RHEL7 and I think (but would need to check on) RHEL6. The 'auto_activation_volume_list' doesn't seem like it exactly what we want here though. It kind of works for what we are wanting to achieve but as a side effect, and I'm not sure it would work for everyone's deployment. For example, is there a way to set 'auto_activation_volume_list' as empty and still be able to ensure that no volume groups are initiated at startup? What I'd really like to see is some sort of 'allow/deny' filter just for startup. Then we could do something like this. # start by denying everything on startup auto_activation_deny_list=[ @* ] # If we need to allow some vg on startup, we can explicitly enable them here. allow_activation_allow_list=[ vg1, vg2 ] Is something like the above possible yet? Using a method like this, we lose the added security that the tags give us outside of the cluster management. I trust pacemaker though :) -- Vossel -- Vossel brassow ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: linux-ha@lists.linux-ha.org Cc: David Vossel dvos...@redhat.com, Fabio M. Di Nitto fdini...@redhat.com, Andrew Beekhof and...@beekhof.net, Lars Marowsky-Bree l...@suse.com, Lon Hohberger l...@redhat.com, Jonathan Brassow jbras...@redhat.com, Dejan Muhamedagic deja...@fastmail.fm Sent: Tuesday, May 14, 2013 6:22:08 AM Subject: LVM Resource agent, exclusive activation This is about pull request https://github.com/ClusterLabs/resource-agents/pull/222 Merge redhat lvm.sh feature set into heartbeat LVM agent Apologies to the CC for list duplicates. Cc list was made by looking at the comments in the pull request, and some previous off-list thread. Even though this is about resource agent feature development, and thus actually a topic for the -dev list, I wanted to give this the maybe wider audience of the users list, to encourage feedback from people who actually *use* this feature with rgmanager, or intend to use it once it is in the pacemaker RA. Here is my perception of this pull request, as such very subjective, and I may have gotten some intentions or facts wrong, so please correct me, or add whatever I may have missed. Appart from a larger restructuring of the code, this introduces the feature of exclusive activation of LVM volume groups. From the commit message: This patch leaves the original LVM heartbeat functionality intact while adding these addition features from the redhat agent. 1. Exclusive activation using volume group tags. This feature allows a volume group to live on shared storage within the cluster without requiring the use of cLVM for metadata locking. 2. individual logical volume activation for local and cluster volume groups by using the new 'lvname' option. 3. Better setup validation when the 'exclusive' option is enabled. This patch validates that when exclusive activation is enabled, either a cluster volume group is in use with cLVM, or the tags variant is configured correctly. These new checks also makes it impossible to enable the exclusive activation for cloned resources. That sounds great. Why even discuss it, of course we want that. But I feel it does not do what it advertises. Rather I think it gives a false sense of exclusivity that is actually not met. (point 2., individual LV activation is ok with me, I think; my difficulties are with the exclusive by tagging thingy) So what does it do. To activate a VG exclusively, it uses LVM tags (see the LVM documentation about these). Any VG or LV can be tagged with a number of tags. Here, only one tag is used (and any other tags will be stripped!). I try to contrast current behaviour and exclusive behaviour: start: non-exclusive: just (try to) activate the VG exclusive by tag: check if a the VG is currently tagged with my node name if not, is it tagged at all? if tagged, and that happens to be a node name that is in the current corosync membership: FAIL activation else, it is tagged, but that is not a node name, or not currently in the membership: strip any and all tags, then proceed if not FAILed because already tagged by an other member, re-tag with *my* nodename activate it. Also it does double check the ownership in monitor: non-exclusive: I think due to the high timeout potential under load when using any LVM commands, this just checks for the presence of the /dev/$VGNAME directory nowadays, which is lightweight, and usually good enough (as the services *using* the LVs are monitored anyways). exclusive by tag: it does the above, then, if active, double checks that the current node name is also the current tag value, and if not (tries to) deactivate (which will usually fail, as it can only succeed if it is unused), and returns failure to Pacemaker, which will then do its recovery cycle. By default, Pacemaker would stop all depending resources, stop this one, and restart the whole stack. Which will, in a real split brain situation just make sure that nodes will keep stealing it from each other; it does not prevent corruption in any way. In a non-split-brain case, this situation can not happen anyways. Unless two nodes raced to activate it, when it was untagged. Oops, so it does not prevent that either. For completeness, on stop: non-exclusive: just deactivate the VG exclusive by tag: double check I am the tag owner then strip that tag (so no tag remains, the VG becomes untagged) and deactivate. So the resource agent tries
Re: [Linux-HA] LVM Resource agent, exclusive activation
- Original Message - From: Lars Ellenberg lars.ellenb...@linbit.com To: Lars Marowsky-Bree l...@suse.com Cc: Fabio M. Di Nitto fdini...@redhat.com, General Linux-HA mailing list linux-ha@lists.linux-ha.org, Jonathan Brassow jbras...@redhat.com Sent: Tuesday, May 14, 2013 9:50:43 AM Subject: Re: [Linux-HA] LVM Resource agent, exclusive activation On Tue, May 14, 2013 at 04:06:09PM +0200, Lars Marowsky-Bree wrote: On 2013-05-14T09:54:55, David Vossel dvos...@redhat.com wrote: Here's what it comes down to. You aren't guaranteed exclusive activation just because pacemaker is in control. There are scenarios with SAN disks where the node starts up and can potentially attempt to activate a volume before pacemaker has initialized. Yeah, from what I've read in the code, the tagged activation would also prevent a manual (or on-boot) vg/lv activation (because it seems lvm itself will refuse). That seems like a good idea to me. Unless I'm wrong, that concept seems sound, barring bugs that need fixing. Sure. And I'm not at all oposed to using tags. I want to get rid of the layer violation, which is the one Bad Thing I'm complaining about. Also, note that on stop, this strips all tags, leaving it untagged. On the next cluster boot, if that was really the concern, all nodes would grab and activate the VG, as it is untagged... That's not how it works. You have to take ownership of the volume before you can activate it. Untagged does not mean a node can activate it without first explicitly setting the tag. So no, in the current form, it just *pretends* to protect against a number of things, but actually does not. And that is the other, even worse, Bad Thing. That's similar to what cLVM2 does and protects against, but without needing the cLVM2/DLM bits; that has, uhm, advantages too. In short, I'm in favor of this feature. (Clearly, lge has pointed out one or two issues that need fixing, that doesn't detract from the idea.) But that would be implemented simply by using tags, and on start: re-tag with my nodename activate That way, it is always tagged, so no stupid initrd, udev or boot script, not even a tired admin, will accidentally activate it. No need for anything else, no callout to membership necessary. All that smoke and mirrors adds complexity, and does not buy us anything, but a false sense of what that could possibly protect us against. If it was tagged with an other node name that is in the membership, then pacemaker would know about it, too, and had made sure it is not activated there. If that other node was not in the membership, we would re-tag and activate anyways. So why not just do that, document that it is done this way, and not pretend it would do more than that. It does not. I've tested this specific scenario and was unable to activate the volume group manually without grabbing the tag first. Have you tested this and found something contrary to my results? This is how the feature is supposed to work. -- Vossel Lars -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Q: limiting parallel execution of resource actions
- Original Message - From: Ulrich Windl ulrich.wi...@rz.uni-regensburg.de To: linux-ha@lists.linux-ha.org Sent: Monday, April 15, 2013 1:39:34 AM Subject: [Linux-HA] Q: limiting parallel execution of resource actions Hi! I have a wuestion: If I want to limit parallel execution of some resources, how can I configure that? Background: Some resources may hit some OS bug when demanding high I/O (e.g. starting several Xen-VMs residing on a OCFS2 filesystem that is mirrored by cLVM). The I/O performance will drop approximately to zero when cLVM is also mirroring. That's not because of the I/O channel being saturated, but because of terrible programming regarding cLVM... anyway: Would it work to define an advisory ordering of resources (in addition to a different mandatory ordering), so if crm would schedule alle those resources at one (in parallel), it would actually schedule the sequentially? If I stop one resource, will those logically after the one being stopped be also stopped due to advisory ordering then? Maybe a new mechanism is needed to restrict parallelism based on resources (i.e. a new type of constraint). hey, 'batch-limit' cluster option might help. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_available_cluster_options -- Vossel Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] IPaddr2 support of ipv6
- Original Message - From: Keisuke MORI keisuke.mori...@gmail.com To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Sent: Tuesday, April 2, 2013 1:41:21 AM Subject: Re: [Linux-HA] IPaddr2 support of ipv6 Hi, 2013/3/29 David Vossel dvos...@redhat.com: Hi, It looks like ipv6 support got added to the IPaddr2 agent last year. I'm curious why the metadata only advertises that the 'ip' option should be an IPv4 address. ip (required): The IPv4 address to be configured in dotted quad notation, for example192.168.1.1. Is this just an oversight? If so this patch would probably help. https://github.com/davidvossel/resource-agents/commit/07be0019a50b96743536ab50727b56d9175bf95f Ah, yes that's just an oversight. Thank you for pointing out. Would you submit your patch as a pull request? :) done, https://github.com/ClusterLabs/resource-agents/pull/219 Thanks, -- Keisuke MORI ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Getting Unknown Error for HA + Asterisk
: David Vossel dvos...@redhat.com Subject: Re: [Linux-HA] Getting Unknown Error To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Message-ID: 2142020906.59260.1364830216244.javamail.r...@redhat.com Content-Type: text/plain; charset=utf-8 - Original Message - From: Ahmed Munir ahmedmunir...@gmail.com To: linux-ha@lists.linux-ha.org Sent: Friday, March 29, 2013 11:26:50 AM Subject: [Linux-HA] Getting Unknown Error Hi, I recently configured Linux HA for Asterisk service (using Asterisk resource agent downloaded from link: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/asterisk ). As per configuration it is working good but when I include monitor_sipuri= sip:42@10.3.152.103 parameter in primitive section it is giving me an errors like listed below; root@asterisk2 ~ crm_mon -1 Last updated: Thu Mar 28 06:09:54 2013 Stack: Heartbeat Current DC: asterisk2 (b966dfa2-5973-4dfc-96ba-b2d38319c174) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ asterisk1 asterisk2 ] Resource Group: group_1 asterisk_2 (lsb:asterisk): Started asterisk1 Do you have two asterisk instances in the cluster, a LSB and OCF one?! I'm confused by this. IPaddr_10_3_152_103(ocf::heartbeat:IPaddr):Started asterisk1 Failed actions: p_asterisk_start_0 (node=asterisk1, call=64, rc=1, status=complete): unknown error p_asterisk_start_0 (node=asterisk2, call=20, rc=1, status=complete): unknown error I tested the 'sipsak' tool on cli, it is executing without any issue i.e. returning 200 OK but when I remove this param monitor_sipuri I'm not getting the errors. Did you use the exact same SIP URI and test it on the box asterisk is running on? Look through the log output. Perhaps the resource agent is outputting some information that could give you a clue as to what is going on. A trick I'd use is running wireshark on the box the asterisk resource is starting on, watch the OPTION request come in and see how it is responded to for the resource agent to rule out any problems there. -- Vossel Listing down the configuration below which I configured; node $id=887bae58-1eb6-47d1-b539-d12a2ed3d836 asterisk1 node $id=b966dfa2-5973-4dfc-96ba-b2d38319c174 asterisk2 primitive IPaddr_10_3_152_103 ocf:heartbeat:IPaddr \ op monitor interval=5s timeout=20s \ params ip=10.3.152.103 primitive p_asterisk ocf:heartbeat:asterisk \ op monitor interval=10s \ params realtime=true group group_1 p_asterisk IPaddr_10_3_152_103 \ meta target-role=Started location rsc_location_group_1 group_1 \ rule $id=preferred_location_group_1 100: #uname eq asterisk1 colocation asterisk-with-ip inf: p_asterisk IPaddr_10_3_152_103 property $id=cib-bootstrap-options \ symmetric-cluster=true \ no-quorum-policy=stop \ default-resource-stickiness=0 \ stonith-enabled=false \ stonith-action=reboot \ startup-fencing=true \ stop-orphan-resources=true \ stop-orphan-actions=true \ remove-after-stop=false \ default-action-timeout=120s \ is-managed-default=true \ cluster-delay=60s \ pe-error-series-max=-1 \ pe-warn-series-max=-1 \ pe-input-series-max=-1 \ dc-version=1.0.12-unknown \ cluster-infrastructure=Heartbeat And the status I'm getting is listed below; root@asterisk1 ~ crm_mon -1 Last updated: Fri Mar 29 12:25:10 2013 Stack: Heartbeat Current DC: asterisk1 (887bae58-1eb6-47d1-b539-d12a2ed3d836) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ asterisk1 asterisk2 ] Resource Group: group_1 p_asterisk (ocf::heartbeat:asterisk): Started asterisk1 IPaddr_10_3_152_103(ocf::heartbeat:IPaddr):Started asterisk1 Please advise to overcome this issue. -- Regards, Ahmed Munir Chohan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems End of Linux-HA Digest, Vol 113, Issue 1
Re: [Linux-HA] Getting Unknown Error
- Original Message - From: Ahmed Munir ahmedmunir...@gmail.com To: linux-ha@lists.linux-ha.org Sent: Friday, March 29, 2013 11:26:50 AM Subject: [Linux-HA] Getting Unknown Error Hi, I recently configured Linux HA for Asterisk service (using Asterisk resource agent downloaded from link: https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/asterisk ). As per configuration it is working good but when I include monitor_sipuri= sip:42@10.3.152.103 parameter in primitive section it is giving me an errors like listed below; root@asterisk2 ~ crm_mon -1 Last updated: Thu Mar 28 06:09:54 2013 Stack: Heartbeat Current DC: asterisk2 (b966dfa2-5973-4dfc-96ba-b2d38319c174) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ asterisk1 asterisk2 ] Resource Group: group_1 asterisk_2 (lsb:asterisk): Started asterisk1 Do you have two asterisk instances in the cluster, a LSB and OCF one?! I'm confused by this. IPaddr_10_3_152_103(ocf::heartbeat:IPaddr):Started asterisk1 Failed actions: p_asterisk_start_0 (node=asterisk1, call=64, rc=1, status=complete): unknown error p_asterisk_start_0 (node=asterisk2, call=20, rc=1, status=complete): unknown error I tested the 'sipsak' tool on cli, it is executing without any issue i.e. returning 200 OK but when I remove this param monitor_sipuri I'm not getting the errors. Did you use the exact same SIP URI and test it on the box asterisk is running on? Look through the log output. Perhaps the resource agent is outputting some information that could give you a clue as to what is going on. A trick I'd use is running wireshark on the box the asterisk resource is starting on, watch the OPTION request come in and see how it is responded to for the resource agent to rule out any problems there. -- Vossel Listing down the configuration below which I configured; node $id=887bae58-1eb6-47d1-b539-d12a2ed3d836 asterisk1 node $id=b966dfa2-5973-4dfc-96ba-b2d38319c174 asterisk2 primitive IPaddr_10_3_152_103 ocf:heartbeat:IPaddr \ op monitor interval=5s timeout=20s \ params ip=10.3.152.103 primitive p_asterisk ocf:heartbeat:asterisk \ op monitor interval=10s \ params realtime=true group group_1 p_asterisk IPaddr_10_3_152_103 \ meta target-role=Started location rsc_location_group_1 group_1 \ rule $id=preferred_location_group_1 100: #uname eq asterisk1 colocation asterisk-with-ip inf: p_asterisk IPaddr_10_3_152_103 property $id=cib-bootstrap-options \ symmetric-cluster=true \ no-quorum-policy=stop \ default-resource-stickiness=0 \ stonith-enabled=false \ stonith-action=reboot \ startup-fencing=true \ stop-orphan-resources=true \ stop-orphan-actions=true \ remove-after-stop=false \ default-action-timeout=120s \ is-managed-default=true \ cluster-delay=60s \ pe-error-series-max=-1 \ pe-warn-series-max=-1 \ pe-input-series-max=-1 \ dc-version=1.0.12-unknown \ cluster-infrastructure=Heartbeat And the status I'm getting is listed below; root@asterisk1 ~ crm_mon -1 Last updated: Fri Mar 29 12:25:10 2013 Stack: Heartbeat Current DC: asterisk1 (887bae58-1eb6-47d1-b539-d12a2ed3d836) - partition with quorum Version: 1.0.12-unknown 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ asterisk1 asterisk2 ] Resource Group: group_1 p_asterisk (ocf::heartbeat:asterisk): Started asterisk1 IPaddr_10_3_152_103(ocf::heartbeat:IPaddr):Started asterisk1 Please advise to overcome this issue. -- Regards, Ahmed Munir Chohan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] IPaddr2 support of ipv6
Hi, It looks like ipv6 support got added to the IPaddr2 agent last year. I'm curious why the metadata only advertises that the 'ip' option should be an IPv4 address. ip (required): The IPv4 address to be configured in dotted quad notation, for example192.168.1.1. Is this just an oversight? If so this patch would probably help. https://github.com/davidvossel/resource-agents/commit/07be0019a50b96743536ab50727b56d9175bf95f -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Linux HA for Asterisk
- Original Message - From: Ahmed Munir ahmedmunir...@gmail.com To: linux-ha@lists.linux-ha.org Sent: Tuesday, March 19, 2013 11:03:38 AM Subject: Re: [Linux-HA] Linux HA for Asterisk Thanks David, BTW, is there any document/material available on Internet for configuring it and testing the Asterisk OCF resource? Please advise. I'm sure there is, but I am unfamiliar with the agent other than knowing it exists. If you don't find any help here, perhaps try the asterisk-users mailing list (subscribe here, http://www.asterisk.org/community/discuss) -- Vossel Date: Tue, 19 Mar 2013 10:49:56 -0400 (EDT) From: David Vossel dvos...@redhat.com Subject: Re: [Linux-HA] Linux HA for Asterisk To: General Linux-HA mailing list linux-ha@lists.linux-ha.org Message-ID: 506620979.929845.1363704596090.javamail.r...@redhat.com Content-Type: text/plain; charset=utf-8 - Original Message - From: Ahmed Munir ahmedmunir...@gmail.com To: linux-ha@lists.linux-ha.org Sent: Monday, March 18, 2013 12:31:17 PM Subject: [Linux-HA] Linux HA for Asterisk Hi all, The version for heartbeat I installed on two CentOS 5.9 boxes is 3.0.3-2 and it is for Asterisk fail over. As per default/standard configuration if I stop heartbeat service from either one machine, the virtual IP automatically assign to another machine and this setup is working good but it only applies on the system level. Whereas I'm looking forward for service level i.e. if Asterisk service isn't running on serverA, IP address automatically assign to serverB. Please advise to accomplish this case (service level failover) as there are no such OCF resources exist for Asterisk. Have you seen this? https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/asterisk -- Vossel Listing down the standard configuration for Linux HA below; nodes node id=887bae58-1eb6-47d1-b539-d12a2ed3d836 uname=asterisk1 type=normal/ node id=b966dfa2-5973-4dfc-96ba-b2d38319c174 uname=asterisk2 type=normal/ /nodes resources group id=group_1 primitive class=ocf id=IPaddr_10_3_152_103 provider=heartbeat type=IPaddr operations op id=IPaddr_10_3_152_103_mon interval=5s name=monitor timeout=5s/ /operations instance_attributes id=IPaddr_10_3_152_103_inst_attr attributes nvpair id=IPaddr_10_3_152_103_attr_0 name=ip value=10.3.152.103/ /attributes /instance_attributes /primitive primitive class=lsb id=asterisk_2 provider=heartbeat type=asterisk operations op id=asterisk_2_mon interval=120s name=monitor timeout=60s/ /operations /primitive /group /resources constraints rsc_location id=rsc_location_group_1 rsc=group_1 rule id=preferred_location_group_1 score=100 expression attribute=#uname id=preferred_location_group_1_expr operation=eq value=asterisk1/ /rule /rsc_location /constraints /configuration -- Regards, Ahmed Munir Chohan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- Regards, Ahmed Munir Chohan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Linux HA for Asterisk
- Original Message - From: Ahmed Munir ahmedmunir...@gmail.com To: linux-ha@lists.linux-ha.org Sent: Monday, March 18, 2013 12:31:17 PM Subject: [Linux-HA] Linux HA for Asterisk Hi all, The version for heartbeat I installed on two CentOS 5.9 boxes is 3.0.3-2 and it is for Asterisk fail over. As per default/standard configuration if I stop heartbeat service from either one machine, the virtual IP automatically assign to another machine and this setup is working good but it only applies on the system level. Whereas I'm looking forward for service level i.e. if Asterisk service isn't running on serverA, IP address automatically assign to serverB. Please advise to accomplish this case (service level failover) as there are no such OCF resources exist for Asterisk. Have you seen this? https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/asterisk -- Vossel Listing down the standard configuration for Linux HA below; nodes node id=887bae58-1eb6-47d1-b539-d12a2ed3d836 uname=asterisk1 type=normal/ node id=b966dfa2-5973-4dfc-96ba-b2d38319c174 uname=asterisk2 type=normal/ /nodes resources group id=group_1 primitive class=ocf id=IPaddr_10_3_152_103 provider=heartbeat type=IPaddr operations op id=IPaddr_10_3_152_103_mon interval=5s name=monitor timeout=5s/ /operations instance_attributes id=IPaddr_10_3_152_103_inst_attr attributes nvpair id=IPaddr_10_3_152_103_attr_0 name=ip value=10.3.152.103/ /attributes /instance_attributes /primitive primitive class=lsb id=asterisk_2 provider=heartbeat type=asterisk operations op id=asterisk_2_mon interval=120s name=monitor timeout=60s/ /operations /primitive /group /resources constraints rsc_location id=rsc_location_group_1 rsc=group_1 rule id=preferred_location_group_1 score=100 expression attribute=#uname id=preferred_location_group_1_expr operation=eq value=asterisk1/ /rule /rsc_location /constraints /configuration -- Regards, Ahmed Munir Chohan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Master/Slave - Master node not monitored after a failure
- Original Message - From: radurad radu@gmail.com To: linux-ha@lists.linux-ha.org Sent: Monday, February 4, 2013 1:51:49 AM Subject: Re: [Linux-HA] Master/Slave - Master node not monitored after a failure Hi, I've installed from rpm's as it was faster (from sources I had to install a lot in devel packages and got stuck at libcpg). The issues is solved, master is being monitored after any numbers of failures. But, there is a new issue I'm facing now (if I'm not able to have it fixed I'll probably make a new post on forum - if one is not already created-): after a couple of failures and restarts at the next failure the mysql is not started anymore; on logs i got the message MySql is not running, but the start/ restart doesn't happen (made sure that failcount is 0, as I have it reseted from time to time). I haven't encountered anything like that. If you can gather the log and pengine cluster data using crm_report we should be able to help figure out what is going on. -- Vossel Thanks again, Radu Rad. David Vossel wrote: - Original Message - From: radurad radu@gmail.com To: linux-ha@lists.linux-ha.org Sent: Wednesday, January 30, 2013 5:10:00 AM Subject: Re: [Linux-HA] Master/Slave - Master node not monitored after a failure Hi, Thank you for clarifying this. On CentOS 6 the latest pacemaker build is 1.1.7 (which i'm using now), do you see a problem if I'm installing from sources so that I'll have the 1.1.8 pacemaker version? The only thing I can think of is that you might have to get a new version of libqb in order to use 1.1.8. We already have a rhel 6 based package you can use if you want. http://clusterlabs.org/rpm-next/ -- Vossel Best Regards, Radu Rad. David Vossel wrote: - Original Message - From: radurad radu@gmail.com To: linux-ha@lists.linux-ha.org Sent: Thursday, January 24, 2013 6:07:38 AM Subject: [Linux-HA] Master/Slave - Master node not monitored after a failure Hi, Using following installation under CentOS corosync-1.4.1-7.el6_3.1.x86_64 resource-agents-3.9.2-12.el6.x86_64 and having the following configuration for a Master/Slave mysql primitive mysqld ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf socket=/var/lib/mysql/mysql.sock datadir=/var/lib/mysql user=mysql replication_user=root replication_passwd=testtest \ op monitor interval=5s role=Slave timeout=31s \ op monitor interval=6s role=Master timeout=30s ms ms_mysql mysqld \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1359026356 \ start-failure-is-fatal=false \ cluster-recheck-interval=60s rsc_defaults $id=rsc-options \ failure-timeout=50s Having only one node online (the Master; with a slave online the problem also occurs, but for simplification I've left only the Master online) I run into the bellow problem: - Stopping once the mysql process results in corosync restarting the mysql again and promoting it to Master. - Stopping again the mysql process results in nothing; the failure is not detected, corosync takes no action and still sees the node as Master and the mysql running. - The operation monitor is not running after the first failure, as there are not entries in log of type: INFO: MySQL monitor succeeded (master). - Changing something in configuration results in corosync detecting immediately that mysql is not running and promotes it. Also the operation monitor will run until the first failure and which the same problem occurs. If you need more information let me know. I could attach the log in the messages files also. Hey, This is a known bug and has been resolved in pacemaker 1.1.8. Here's the related issue. The commits are listed in the comments. http://bugs.clusterlabs.org/show_bug.cgi?id=5072 -- Vossel Thanks for now, Radu. -- View this message in context: http://old.nabble.com/Master-Slave---Master-node-not-monitored-after-a-failure-tp34939865p34939865.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Master/Slave - Master node not monitored after a failure
- Original Message - From: radurad radu@gmail.com To: linux-ha@lists.linux-ha.org Sent: Wednesday, January 30, 2013 5:10:00 AM Subject: Re: [Linux-HA] Master/Slave - Master node not monitored after a failure Hi, Thank you for clarifying this. On CentOS 6 the latest pacemaker build is 1.1.7 (which i'm using now), do you see a problem if I'm installing from sources so that I'll have the 1.1.8 pacemaker version? The only thing I can think of is that you might have to get a new version of libqb in order to use 1.1.8. We already have a rhel 6 based package you can use if you want. http://clusterlabs.org/rpm-next/ -- Vossel Best Regards, Radu Rad. David Vossel wrote: - Original Message - From: radurad radu@gmail.com To: linux-ha@lists.linux-ha.org Sent: Thursday, January 24, 2013 6:07:38 AM Subject: [Linux-HA] Master/Slave - Master node not monitored after a failure Hi, Using following installation under CentOS corosync-1.4.1-7.el6_3.1.x86_64 resource-agents-3.9.2-12.el6.x86_64 and having the following configuration for a Master/Slave mysql primitive mysqld ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf socket=/var/lib/mysql/mysql.sock datadir=/var/lib/mysql user=mysql replication_user=root replication_passwd=testtest \ op monitor interval=5s role=Slave timeout=31s \ op monitor interval=6s role=Master timeout=30s ms ms_mysql mysqld \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1359026356 \ start-failure-is-fatal=false \ cluster-recheck-interval=60s rsc_defaults $id=rsc-options \ failure-timeout=50s Having only one node online (the Master; with a slave online the problem also occurs, but for simplification I've left only the Master online) I run into the bellow problem: - Stopping once the mysql process results in corosync restarting the mysql again and promoting it to Master. - Stopping again the mysql process results in nothing; the failure is not detected, corosync takes no action and still sees the node as Master and the mysql running. - The operation monitor is not running after the first failure, as there are not entries in log of type: INFO: MySQL monitor succeeded (master). - Changing something in configuration results in corosync detecting immediately that mysql is not running and promotes it. Also the operation monitor will run until the first failure and which the same problem occurs. If you need more information let me know. I could attach the log in the messages files also. Hey, This is a known bug and has been resolved in pacemaker 1.1.8. Here's the related issue. The commits are listed in the comments. http://bugs.clusterlabs.org/show_bug.cgi?id=5072 -- Vossel Thanks for now, Radu. -- View this message in context: http://old.nabble.com/Master-Slave---Master-node-not-monitored-after-a-failure-tp34939865p34939865.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems -- View this message in context: http://old.nabble.com/Master-Slave---Master-node-not-monitored-after-a-failure-tp34939865p34962132.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Master/Slave - Master node not monitored after a failure
- Original Message - From: radurad radu@gmail.com To: linux-ha@lists.linux-ha.org Sent: Thursday, January 24, 2013 6:07:38 AM Subject: [Linux-HA] Master/Slave - Master node not monitored after a failure Hi, Using following installation under CentOS corosync-1.4.1-7.el6_3.1.x86_64 resource-agents-3.9.2-12.el6.x86_64 and having the following configuration for a Master/Slave mysql primitive mysqld ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/etc/my.cnf socket=/var/lib/mysql/mysql.sock datadir=/var/lib/mysql user=mysql replication_user=root replication_passwd=testtest \ op monitor interval=5s role=Slave timeout=31s \ op monitor interval=6s role=Master timeout=30s ms ms_mysql mysqld \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true property $id=cib-bootstrap-options \ dc-version=1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ no-quorum-policy=ignore \ stonith-enabled=false \ last-lrm-refresh=1359026356 \ start-failure-is-fatal=false \ cluster-recheck-interval=60s rsc_defaults $id=rsc-options \ failure-timeout=50s Having only one node online (the Master; with a slave online the problem also occurs, but for simplification I've left only the Master online) I run into the bellow problem: - Stopping once the mysql process results in corosync restarting the mysql again and promoting it to Master. - Stopping again the mysql process results in nothing; the failure is not detected, corosync takes no action and still sees the node as Master and the mysql running. - The operation monitor is not running after the first failure, as there are not entries in log of type: INFO: MySQL monitor succeeded (master). - Changing something in configuration results in corosync detecting immediately that mysql is not running and promotes it. Also the operation monitor will run until the first failure and which the same problem occurs. If you need more information let me know. I could attach the log in the messages files also. Hey, This is a known bug and has been resolved in pacemaker 1.1.8. Here's the related issue. The commits are listed in the comments. http://bugs.clusterlabs.org/show_bug.cgi?id=5072 -- Vossel Thanks for now, Radu. -- View this message in context: http://old.nabble.com/Master-Slave---Master-node-not-monitored-after-a-failure-tp34939865p34939865.html Sent from the Linux-HA mailing list archive at Nabble.com. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Some help on understanding how HA issues are addressed by pacemaker
- Original Message - From: Hermes Flying flyingher...@yahoo.com To: linux-ha@lists.linux-ha.org Sent: Friday, November 30, 2012 4:04:34 PM Subject: [Linux-HA] Some help on understanding how HA issues are addressed by pacemaker Hi, I am looking into using your facilities to have high availability on my system. I am trying to figure out some things. I hope you guys could help me. I am interested in knowing how pacemaker migrates a VIP and how a splitbrain situation is address by your facilities. To be specific: I am interested in the following setup: 2 linux machines. Each machine runs a load balancer and a Tomcat instance. If I understand correctly pacemaker will be responsible to assign the main VIP to one of the nodes. My questions are: 1)Will pacemaker monitor/restart the load balancers on each machine in case of crash? 2) How does pacemaker decide to migrate the VIP to the other node? 3) Do the pacemakers in each machine communicate? If yes how do you handle network failure? Could I end up with split-brain? 4) Generally how is split-brain addressed using pacemaker? 5) Could pacemaker monitor Tomcat? As you can see I am interested in maintain quorum in a two-node configuration. If you can help me with this info to find a proper direction it would be much appreciated! Thank you Hey, You may or may not have looked at this already, but this is a good place to start, http://clusterlabs.org/doc/ Read chapter one of this document. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html Running through this 2 node cluster exercise will likely answer many of your questions. http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Clusters_from_Scratch/index.html -- Vossel ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems