Re: [ClusterLabs] Pacemaker crash and fencing failure
21.11.2015 03:38, Brian Campbell пишет: What I'm concerned about is the initial failure of crmd on master1 that led to master2 deciding to fence it, and then master2's failure to fence master1 and thus getting stuck and not being able to manage resources. It seems to have simply stopped doing anything, with no logs indicating why it did so. That's actually normal. If fencing is required but could not be performed cluster is stuck - no further actions can be completed in this state. So the root cause here seems to be unsuccessful fencing. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Pacemaker crash and fencing failure
I've been trying to debug and do a root cause analysis for a cascading series of failures that a customer hit a couple of days ago, that caused their filesystem to be unavailable for a couple of hours. The original failure was in our own distributed filesystem backend, a fork of LizardFS, which is in turn a fork of MoosFS This history is mostly only important in reading the logs, where "efs", "lizardfs", and "mfs" all generally refer to the same services, just different generations of naming them as not all daemons, scripts, and packages have been renamed. There are two master servers that handle metadata operations, running Pacemaker to elect which one is the current primary and which one is a replica, and a number of chunkservers that store file chunks and simply connect to the current running master via a virtual IP. A bug in doing a checksum scan on the chunkservers caused them to leak file descriptors and become unresponsive, so while the master server was up and healthy, no actual filesystem operations could occur. (This bug is now fixed by the way, and the fix deployed to the customer, but we want to debug why the later failures occurred that caused them to continue to have downtime). The customer saw that things were unresponsive, and tried doing the simplest thing they could to try to resolve it, migrate the services to the other master. This succeeded, as the checksum scan had been initiated by the first master and so switching over to the replica caused all of the extra file descriptors to be closed and the chunkservers to become responsive again. However, due to one backup service that is not yet managed via Pacemaker and thus is only running on the first master, they decided to migrate back to the first master. This was when they ran into a Pacemaker problem. At the time of the problem, es-efs-master1 is the server that was originally the master when the first problem happened, and which they are trying to migrate the services back to. es-efs-master2 is the one actively running the services, and also happens to be the DC at the time to that's where to look for pengine messages. On master2, you can see the point when the user tried to migrate back to master1 based on the pengine decisions: (by the way, apologies for the long message with large log excerpts; I was trying to balance enough detail with not overwhelming, it can be hard to keep it short when explaining these kinds of complicated failures across a number of machines) Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e- 46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master2 -> es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Demote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master -> Slave es-efs-master2) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: process_pe_message: Calculated Transition 1481601: /var/lib/pacemaker/pengine/pe-input-1355.bz2 Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]: warning: cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not applied to 0.2754083.1: Failed application of an update diff Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: process_lrm_event: LRM operation editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0 (call=400, rc=0, cib-update=0, confirmed=true) ok Nov 18 08:28:28 es-efs-master2 crmd[1924]: notice: run_graph: Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15, Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2): Stopped Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Processing failed op demote for editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 on es-efs-master2: unknown error (1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: warning: unpack_rsc_op: Forcing editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after a failed demote action Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started es-efs-master2 -> es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Promote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave -> Master es-efs-master1) Nov 18 08:28:28 es-efs-master2 pengine[1923]: notice: LogActions: Demote editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master -> Slave es-efs-m
Re: [ClusterLabs] start service after filesystemressource
On 11/20/2015 07:38 AM, haseni...@gmx.de wrote: > Hi, > I want to start several services after the drbd ressource an the filessystem > is > avaiable. This is my current configuration: > node $id="184548773" host-1 \ > attributes standby="on" > node $id="184548774" host-2 \ > attributes standby="on" > primitive collectd lsb:collectd \ > op monitor interval="10" timeout="30" \ > op start interval="0" timeout="120" \ > op stop interval="0" timeout="120" > primitive failover-ip1 ocf:heartbeat:IPaddr \ > params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \ > op monitor interval="10s" > primitive failover-ip2 ocf:heartbeat:IPaddr \ > params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \ > op monitor interval="10s" > primitive failover-ip3 ocf:heartbeat:IPaddr \ > params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \ > op monitor interval="10s" > primitive res_drbd_export ocf:linbit:drbd \ > params drbd_resource="hermes" > primitive res_fs ocf:heartbeat:Filesystem \ > params device="/dev/drbd0" directory="/mnt" fstype="ext4" > group mygroup failover-ip1 failover-ip2 failover-ip3 collectd > ms ms_drbd_export res_drbd_export \ > meta notify="true" master-max="1" master-node-max="1" clone-max="2" > clone-node-max="1" > location cli-prefer-collectd collectd inf: host-1 > location cli-prefer-failover-ip1 failover-ip1 inf: host-1 > location cli-prefer-failover-ip2 failover-ip2 inf: host-1 > location cli-prefer-failover-ip3 failover-ip3 inf: host-1 > location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1 > location cli-prefer-res_fs res_fs inf: host-1 A word of warning, these "cli-" constraints were added automatically when you ran CLI commands to move resources to specific hosts. You have to clear these when you're done with whatever the move was for, otherwise the resources will only run on those nodes from now on. If you're using pcs, "pcs resource clear " will do it. > colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master > order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start > property $id="cib-bootstrap-options" \ > dc-version="1.1.10-42f2063" \ > cluster-infrastructure="corosync" \ > stonith-enabled="false" \ > no-quorum-policy="ignore" \ > last-lrm-refresh="1447686090" > #vim:set syntax=pcmk > I don't found the right way, to order the startup of new services (example > collectd), after the /mnt is mounted. Can you help me? As other posters mentioned, order constraints and/or groups will do that. Exact syntax depends on what CLI tools you use, check their man pages for details. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] start service after filesystemressource
20.11.2015 17:53, emmanuel segura пишет: using group is more simple example: group mygroup resource1 resource2 resource 3 group implies ordering and dependencies inside group and I do not know if it is correct here. Using resource set in order statement is another possibility. order o_drbd_before_services inf: ms_drbd_export:promote mygroup:start 2015-11-20 15:45 GMT+01:00 Andrei Borzenkov : 20.11.2015 16:38, haseni...@gmx.de пишет: Hi, I want to start several services after the drbd ressource an the filessystem is avaiable. This is my current configuration: node $id="184548773" host-1 \ attributes standby="on" node $id="184548774" host-2 \ attributes standby="on" primitive collectd lsb:collectd \ op monitor interval="10" timeout="30" \ op start interval="0" timeout="120" \ op stop interval="0" timeout="120" primitive failover-ip1 ocf:heartbeat:IPaddr \ params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \ op monitor interval="10s" primitive failover-ip2 ocf:heartbeat:IPaddr \ params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \ op monitor interval="10s" primitive failover-ip3 ocf:heartbeat:IPaddr \ params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \ op monitor interval="10s" primitive res_drbd_export ocf:linbit:drbd \ params drbd_resource="hermes" primitive res_fs ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/mnt" fstype="ext4" group mygroup failover-ip1 failover-ip2 failover-ip3 collectd ms ms_drbd_export res_drbd_export \ meta notify="true" master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" location cli-prefer-collectd collectd inf: host-1 location cli-prefer-failover-ip1 failover-ip1 inf: host-1 location cli-prefer-failover-ip2 failover-ip2 inf: host-1 location cli-prefer-failover-ip3 failover-ip3 inf: host-1 location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1 location cli-prefer-res_fs res_fs inf: host-1 colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start property $id="cib-bootstrap-options" \ dc-version="1.1.10-42f2063" \ cluster-infrastructure="corosync" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1447686090" #vim:set syntax=pcmk I don't found the right way, to order the startup of new services (example collectd), after the /mnt is mounted. Just order them after res_fs, same as you order res_fs after ms_drbd_export. Or may be I misunderstand your question? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] start service after filesystemressource
using group is more simple example: group mygroup resource1 resource2 resource 3 order o_drbd_before_services inf: ms_drbd_export:promote mygroup:start 2015-11-20 15:45 GMT+01:00 Andrei Borzenkov : > 20.11.2015 16:38, haseni...@gmx.de пишет: > >> Hi, >> I want to start several services after the drbd ressource an the >> filessystem is >> avaiable. This is my current configuration: >> node $id="184548773" host-1 \ >> attributes standby="on" >> node $id="184548774" host-2 \ >> attributes standby="on" >> primitive collectd lsb:collectd \ >> op monitor interval="10" timeout="30" \ >> op start interval="0" timeout="120" \ >> op stop interval="0" timeout="120" >> primitive failover-ip1 ocf:heartbeat:IPaddr \ >> params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \ >> op monitor interval="10s" >> primitive failover-ip2 ocf:heartbeat:IPaddr \ >> params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \ >> op monitor interval="10s" >> primitive failover-ip3 ocf:heartbeat:IPaddr \ >> params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \ >> op monitor interval="10s" >> primitive res_drbd_export ocf:linbit:drbd \ >> params drbd_resource="hermes" >> primitive res_fs ocf:heartbeat:Filesystem \ >> params device="/dev/drbd0" directory="/mnt" fstype="ext4" >> group mygroup failover-ip1 failover-ip2 failover-ip3 collectd >> ms ms_drbd_export res_drbd_export \ >> meta notify="true" master-max="1" master-node-max="1" >> clone-max="2" >> clone-node-max="1" >> location cli-prefer-collectd collectd inf: host-1 >> location cli-prefer-failover-ip1 failover-ip1 inf: host-1 >> location cli-prefer-failover-ip2 failover-ip2 inf: host-1 >> location cli-prefer-failover-ip3 failover-ip3 inf: host-1 >> location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1 >> location cli-prefer-res_fs res_fs inf: host-1 >> colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master >> order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start >> property $id="cib-bootstrap-options" \ >> dc-version="1.1.10-42f2063" \ >> cluster-infrastructure="corosync" \ >> stonith-enabled="false" \ >> no-quorum-policy="ignore" \ >> last-lrm-refresh="1447686090" >> #vim:set syntax=pcmk >> I don't found the right way, to order the startup of new services (example >> collectd), after the /mnt is mounted. > > > Just order them after res_fs, same as you order res_fs after ms_drbd_export. > Or may be I misunderstand your question? > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- .~. /V\ // \\ /( )\ ^`~'^ ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] start service after filesystemressource
20.11.2015 16:38, haseni...@gmx.de пишет: Hi, I want to start several services after the drbd ressource an the filessystem is avaiable. This is my current configuration: node $id="184548773" host-1 \ attributes standby="on" node $id="184548774" host-2 \ attributes standby="on" primitive collectd lsb:collectd \ op monitor interval="10" timeout="30" \ op start interval="0" timeout="120" \ op stop interval="0" timeout="120" primitive failover-ip1 ocf:heartbeat:IPaddr \ params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \ op monitor interval="10s" primitive failover-ip2 ocf:heartbeat:IPaddr \ params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \ op monitor interval="10s" primitive failover-ip3 ocf:heartbeat:IPaddr \ params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \ op monitor interval="10s" primitive res_drbd_export ocf:linbit:drbd \ params drbd_resource="hermes" primitive res_fs ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/mnt" fstype="ext4" group mygroup failover-ip1 failover-ip2 failover-ip3 collectd ms ms_drbd_export res_drbd_export \ meta notify="true" master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" location cli-prefer-collectd collectd inf: host-1 location cli-prefer-failover-ip1 failover-ip1 inf: host-1 location cli-prefer-failover-ip2 failover-ip2 inf: host-1 location cli-prefer-failover-ip3 failover-ip3 inf: host-1 location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1 location cli-prefer-res_fs res_fs inf: host-1 colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start property $id="cib-bootstrap-options" \ dc-version="1.1.10-42f2063" \ cluster-infrastructure="corosync" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1447686090" #vim:set syntax=pcmk I don't found the right way, to order the startup of new services (example collectd), after the /mnt is mounted. Just order them after res_fs, same as you order res_fs after ms_drbd_export. Or may be I misunderstand your question? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] start service after filesystemressource
Hi, I want to start several services after the drbd ressource an the filessystem is avaiable. This is my current configuration: node $id="184548773" host-1 \ attributes standby="on" node $id="184548774" host-2 \ attributes standby="on" primitive collectd lsb:collectd \ op monitor interval="10" timeout="30" \ op start interval="0" timeout="120" \ op stop interval="0" timeout="120" primitive failover-ip1 ocf:heartbeat:IPaddr \ params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \ op monitor interval="10s" primitive failover-ip2 ocf:heartbeat:IPaddr \ params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \ op monitor interval="10s" primitive failover-ip3 ocf:heartbeat:IPaddr \ params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \ op monitor interval="10s" primitive res_drbd_export ocf:linbit:drbd \ params drbd_resource="hermes" primitive res_fs ocf:heartbeat:Filesystem \ params device="/dev/drbd0" directory="/mnt" fstype="ext4" group mygroup failover-ip1 failover-ip2 failover-ip3 collectd ms ms_drbd_export res_drbd_export \ meta notify="true" master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" location cli-prefer-collectd collectd inf: host-1 location cli-prefer-failover-ip1 failover-ip1 inf: host-1 location cli-prefer-failover-ip2 failover-ip2 inf: host-1 location cli-prefer-failover-ip3 failover-ip3 inf: host-1 location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1 location cli-prefer-res_fs res_fs inf: host-1 colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start property $id="cib-bootstrap-options" \ dc-version="1.1.10-42f2063" \ cluster-infrastructure="corosync" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1447686090" #vim:set syntax=pcmk I don't found the right way, to order the startup of new services (example collectd), after the /mnt is mounted. Can you help me? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org