date:20151120

Re: [ClusterLabs] Pacemaker crash and fencing failure

2015-11-20 Thread Andrei Borzenkov


21.11.2015 03:38, Brian Campbell пишет:


What I'm concerned about is the initial failure of crmd on master1
that led to master2 deciding to fence it, and then master2's failure
to fence master1 and thus getting stuck and not being able to manage
resources. It seems to have simply stopped doing anything, with no
logs indicating why it did so.



That's actually normal. If fencing is required but could not be 
performed cluster is stuck - no further actions can be completed in this 
state. So the root cause here seems to be unsuccessful fencing.




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] Pacemaker crash and fencing failure

2015-11-20 Thread Brian Campbell

I've been trying to debug and do a root cause analysis for a cascading
series of failures that a customer hit a couple of days ago, that
caused their filesystem to be unavailable for a couple of hours.

The original failure was in our own distributed filesystem backend, a
fork of LizardFS, which is in turn a fork of MoosFS This history is
mostly only important in reading the logs, where "efs", "lizardfs",
and "mfs" all generally refer to the same services, just different
generations of naming them as not all daemons, scripts, and packages
have been renamed.

There are two master servers that handle metadata operations, running
Pacemaker to elect which one is the current primary and which one is a
replica, and a number of chunkservers that store file chunks and
simply connect to the current running master via a virtual IP. A bug
in doing a checksum scan on the chunkservers caused them to leak file
descriptors and become unresponsive, so while the master server was up
and healthy, no actual filesystem operations could occur. (This bug is
now fixed by the way, and the fix deployed to the customer, but we
want to debug why the later failures occurred that caused them to
continue to have downtime).

The customer saw that things were unresponsive, and tried doing the
simplest thing they could to try to resolve it, migrate the services
to the other master. This succeeded, as the checksum scan had been
initiated by the first master and so switching over to the replica
caused all of the extra file descriptors to be closed and the
chunkservers to become responsive again.

However, due to one backup service that is not yet managed via
Pacemaker and thus is only running on the first master, they decided
to migrate back to the first master. This was when they ran into a
Pacemaker problem.

At the time of the problem, es-efs-master1 is the server that was
originally the master when the first problem happened, and which they
are trying to migrate the services back to. es-efs-master2 is the one
actively running the services, and also happens to be the DC at the
time to that's where to look for pengine messages.

On master2, you can see the point when the user tried to migrate back
to master1 based on the pengine decisions:

(by the way, apologies for the long message with large log excerpts; I
was trying to balance enough detail with not overwhelming, it can be
hard to keep it short when explaining these kinds of complicated
failures across a number of machines)

Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing editshare.stack.7c645b0e-
46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1 to stop after
a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave es-efs-master2)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice:
process_pe_message: Calculated Transition 1481601:
/var/lib/pacemaker/pengine/pe-input-1355.bz2
Nov 18 08:28:28 es-efs-master2 stonith-ng[1920]:  warning:
cib_process_diff: Diff 0.2754083.1 -> 0.2754083.2 from local not
applied to 0.2754083.1: Failed application of an update diff
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice:
process_lrm_event: LRM operation
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive_notify_0
(call=400, rc=0, cib-update=0, confirmed=true) ok
Nov 18 08:28:28 es-efs-master2 crmd[1924]:   notice: run_graph:
Transition 1481601 (Complete=5, Pending=0, Fired=0, Skipped=15,
Incomplete=10, Source=/var/lib/pacemaker/pengine/pe-input-1355.bz2):
Stopped
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Processing failed op demote for
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
on es-efs-master2: unknown error (1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:  warning: unpack_rsc_op:
Forcing 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1
to stop after a failed demote action
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Moveeditshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.ip#011(Started
es-efs-master2 -> es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Promote 
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:0#011(Slave
-> Master es-efs-master1)
Nov 18 08:28:28 es-efs-master2 pengine[1923]:   notice: LogActions:
Demote  
editshare.stack.7c645b0e-46bb-407e-b48a-92ec3121f2d7.lizardfs-master.primitive:1#011(Master
-> Slave es-efs-m

Re: [ClusterLabs] start service after filesystemressource

2015-11-20 Thread Ken Gaillot

On 11/20/2015 07:38 AM, haseni...@gmx.de wrote:
> Hi,
> I want to start several services after the drbd ressource an the filessystem 
> is 
> avaiable. This is my current configuration:
> node $id="184548773" host-1 \
>  attributes standby="on"
> node $id="184548774" host-2 \
>  attributes standby="on"
> primitive collectd lsb:collectd \
>  op monitor interval="10" timeout="30" \
>  op start interval="0" timeout="120" \
>  op stop interval="0" timeout="120"
> primitive failover-ip1 ocf:heartbeat:IPaddr \
>  params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
>  op monitor interval="10s"
> primitive failover-ip2 ocf:heartbeat:IPaddr \
>  params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
>  op monitor interval="10s"
> primitive failover-ip3 ocf:heartbeat:IPaddr \
>  params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
>  op monitor interval="10s"
> primitive res_drbd_export ocf:linbit:drbd \
>  params drbd_resource="hermes"
> primitive res_fs ocf:heartbeat:Filesystem \
>  params device="/dev/drbd0" directory="/mnt" fstype="ext4"
> group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
> ms ms_drbd_export res_drbd_export \
>  meta notify="true" master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1"
> location cli-prefer-collectd collectd inf: host-1
> location cli-prefer-failover-ip1 failover-ip1 inf: host-1
> location cli-prefer-failover-ip2 failover-ip2 inf: host-1
> location cli-prefer-failover-ip3 failover-ip3 inf: host-1
> location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
> location cli-prefer-res_fs res_fs inf: host-1

A word of warning, these "cli-" constraints were added automatically
when you ran CLI commands to move resources to specific hosts. You have
to clear these when you're done with whatever the move was for,
otherwise the resources will only run on those nodes from now on.

If you're using pcs, "pcs resource clear " will do it.

> colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
> order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
> property $id="cib-bootstrap-options" \
>  dc-version="1.1.10-42f2063" \
>  cluster-infrastructure="corosync" \
>  stonith-enabled="false" \
>  no-quorum-policy="ignore" \
>  last-lrm-refresh="1447686090"
> #vim:set syntax=pcmk
> I don't found the right way, to order the startup of new services (example 
> collectd), after the /mnt is mounted. Can you help me?

As other posters mentioned, order constraints and/or groups will do
that. Exact syntax depends on what CLI tools you use, check their man
pages for details.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] start service after filesystemressource

2015-11-20 Thread Andrei Borzenkov


20.11.2015 17:53, emmanuel segura пишет:

using group is more simple

example:

group mygroup resource1 resource2 resource 3


group implies ordering and dependencies inside group and I do not know 
if it is correct here. Using resource set in order statement is another 
possibility.



order o_drbd_before_services inf: ms_drbd_export:promote mygroup:start

2015-11-20 15:45 GMT+01:00 Andrei Borzenkov :

20.11.2015 16:38, haseni...@gmx.de пишет:


Hi,
I want to start several services after the drbd ressource an the
filessystem is
avaiable. This is my current configuration:
node $id="184548773" host-1 \
   attributes standby="on"
node $id="184548774" host-2 \
   attributes standby="on"
primitive collectd lsb:collectd \
   op monitor interval="10" timeout="30" \
   op start interval="0" timeout="120" \
   op stop interval="0" timeout="120"
primitive failover-ip1 ocf:heartbeat:IPaddr \
   params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
   op monitor interval="10s"
primitive failover-ip2 ocf:heartbeat:IPaddr \
   params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
   op monitor interval="10s"
primitive failover-ip3 ocf:heartbeat:IPaddr \
   params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
   op monitor interval="10s"
primitive res_drbd_export ocf:linbit:drbd \
   params drbd_resource="hermes"
primitive res_fs ocf:heartbeat:Filesystem \
   params device="/dev/drbd0" directory="/mnt" fstype="ext4"
group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
ms ms_drbd_export res_drbd_export \
   meta notify="true" master-max="1" master-node-max="1"
clone-max="2"
clone-node-max="1"
location cli-prefer-collectd collectd inf: host-1
location cli-prefer-failover-ip1 failover-ip1 inf: host-1
location cli-prefer-failover-ip2 failover-ip2 inf: host-1
location cli-prefer-failover-ip3 failover-ip3 inf: host-1
location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
location cli-prefer-res_fs res_fs inf: host-1
colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
property $id="cib-bootstrap-options" \
   dc-version="1.1.10-42f2063" \
   cluster-infrastructure="corosync" \
   stonith-enabled="false" \
   no-quorum-policy="ignore" \
   last-lrm-refresh="1447686090"
#vim:set syntax=pcmk
I don't found the right way, to order the startup of new services (example
collectd), after the /mnt is mounted.



Just order them after res_fs, same as you order res_fs after ms_drbd_export.
Or may be I misunderstand your question?


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org







___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] start service after filesystemressource

2015-11-20 Thread emmanuel segura

using group is more simple

example:

group mygroup resource1 resource2 resource 3
order o_drbd_before_services inf: ms_drbd_export:promote mygroup:start

2015-11-20 15:45 GMT+01:00 Andrei Borzenkov :
> 20.11.2015 16:38, haseni...@gmx.de пишет:
>
>> Hi,
>> I want to start several services after the drbd ressource an the
>> filessystem is
>> avaiable. This is my current configuration:
>> node $id="184548773" host-1 \
>>   attributes standby="on"
>> node $id="184548774" host-2 \
>>   attributes standby="on"
>> primitive collectd lsb:collectd \
>>   op monitor interval="10" timeout="30" \
>>   op start interval="0" timeout="120" \
>>   op stop interval="0" timeout="120"
>> primitive failover-ip1 ocf:heartbeat:IPaddr \
>>   params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
>>   op monitor interval="10s"
>> primitive failover-ip2 ocf:heartbeat:IPaddr \
>>   params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
>>   op monitor interval="10s"
>> primitive failover-ip3 ocf:heartbeat:IPaddr \
>>   params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
>>   op monitor interval="10s"
>> primitive res_drbd_export ocf:linbit:drbd \
>>   params drbd_resource="hermes"
>> primitive res_fs ocf:heartbeat:Filesystem \
>>   params device="/dev/drbd0" directory="/mnt" fstype="ext4"
>> group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
>> ms ms_drbd_export res_drbd_export \
>>   meta notify="true" master-max="1" master-node-max="1"
>> clone-max="2"
>> clone-node-max="1"
>> location cli-prefer-collectd collectd inf: host-1
>> location cli-prefer-failover-ip1 failover-ip1 inf: host-1
>> location cli-prefer-failover-ip2 failover-ip2 inf: host-1
>> location cli-prefer-failover-ip3 failover-ip3 inf: host-1
>> location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
>> location cli-prefer-res_fs res_fs inf: host-1
>> colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
>> order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
>> property $id="cib-bootstrap-options" \
>>   dc-version="1.1.10-42f2063" \
>>   cluster-infrastructure="corosync" \
>>   stonith-enabled="false" \
>>   no-quorum-policy="ignore" \
>>   last-lrm-refresh="1447686090"
>> #vim:set syntax=pcmk
>> I don't found the right way, to order the startup of new services (example
>> collectd), after the /mnt is mounted.
>
>
> Just order them after res_fs, same as you order res_fs after ms_drbd_export.
> Or may be I misunderstand your question?
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
  .~.
  /V\
 //  \\
/(   )\
^`~'^

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] start service after filesystemressource

2015-11-20 Thread Andrei Borzenkov


20.11.2015 16:38, haseni...@gmx.de пишет:

Hi,
I want to start several services after the drbd ressource an the filessystem is
avaiable. This is my current configuration:
node $id="184548773" host-1 \
  attributes standby="on"
node $id="184548774" host-2 \
  attributes standby="on"
primitive collectd lsb:collectd \
  op monitor interval="10" timeout="30" \
  op start interval="0" timeout="120" \
  op stop interval="0" timeout="120"
primitive failover-ip1 ocf:heartbeat:IPaddr \
  params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
  op monitor interval="10s"
primitive failover-ip2 ocf:heartbeat:IPaddr \
  params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
  op monitor interval="10s"
primitive failover-ip3 ocf:heartbeat:IPaddr \
  params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
  op monitor interval="10s"
primitive res_drbd_export ocf:linbit:drbd \
  params drbd_resource="hermes"
primitive res_fs ocf:heartbeat:Filesystem \
  params device="/dev/drbd0" directory="/mnt" fstype="ext4"
group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
ms ms_drbd_export res_drbd_export \
  meta notify="true" master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1"
location cli-prefer-collectd collectd inf: host-1
location cli-prefer-failover-ip1 failover-ip1 inf: host-1
location cli-prefer-failover-ip2 failover-ip2 inf: host-1
location cli-prefer-failover-ip3 failover-ip3 inf: host-1
location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
location cli-prefer-res_fs res_fs inf: host-1
colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
property $id="cib-bootstrap-options" \
  dc-version="1.1.10-42f2063" \
  cluster-infrastructure="corosync" \
  stonith-enabled="false" \
  no-quorum-policy="ignore" \
  last-lrm-refresh="1447686090"
#vim:set syntax=pcmk
I don't found the right way, to order the startup of new services (example
collectd), after the /mnt is mounted.


Just order them after res_fs, same as you order res_fs after 
ms_drbd_export. Or may be I misunderstand your question?



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[ClusterLabs] start service after filesystemressource

2015-11-20 Thread haseningo

Hi,

I want to start several services after the drbd ressource an the filessystem is avaiable. This is my current configuration:

 

node $id="184548773" host-1 \
    attributes standby="on"
node $id="184548774" host-2 \
    attributes standby="on"
primitive collectd lsb:collectd \
    op monitor interval="10" timeout="30" \
    op start interval="0" timeout="120" \
    op stop interval="0" timeout="120"
primitive failover-ip1 ocf:heartbeat:IPaddr \
    params ip="192.168.6.6" nic="eth0:0" cidr_netmask="32" \
    op monitor interval="10s"
primitive failover-ip2 ocf:heartbeat:IPaddr \
    params ip="192.168.6.7" nic="eth0:1" cidr_netmask="32" \
    op monitor interval="10s"
primitive failover-ip3 ocf:heartbeat:IPaddr \
    params ip="192.168.6.8" nic="eth0:2" cidr_netmask="32" \
    op monitor interval="10s"
primitive res_drbd_export ocf:linbit:drbd \
    params drbd_resource="hermes"
primitive res_fs ocf:heartbeat:Filesystem \
    params device="/dev/drbd0" directory="/mnt" fstype="ext4"
group mygroup failover-ip1 failover-ip2 failover-ip3 collectd
ms ms_drbd_export res_drbd_export \
    meta notify="true" master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
location cli-prefer-collectd collectd inf: host-1
location cli-prefer-failover-ip1 failover-ip1 inf: host-1
location cli-prefer-failover-ip2 failover-ip2 inf: host-1
location cli-prefer-failover-ip3 failover-ip3 inf: host-1
location cli-prefer-res_drbd_export res_drbd_export inf: hermes-1
location cli-prefer-res_fs res_fs inf: host-1
colocation c_export_on_drbd inf: mygroup res_fs ms_drbd_export:Master
order o_drbd_before_services inf: ms_drbd_export:promote res_fs:start
property $id="cib-bootstrap-options" \
    dc-version="1.1.10-42f2063" \
    cluster-infrastructure="corosync" \
    stonith-enabled="false" \
    no-quorum-policy="ignore" \
    last-lrm-refresh="1447686090"
#vim:set syntax=pcmk

 

I don't found the right way, to order the startup of new services (example collectd), after the /mnt is mounted. Can you help me?


 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker crash and fencing failure

[ClusterLabs] Pacemaker crash and fencing failure

Re: [ClusterLabs] start service after filesystemressource

Re: [ClusterLabs] start service after filesystemressource

Re: [ClusterLabs] start service after filesystemressource

Re: [ClusterLabs] start service after filesystemressource

[ClusterLabs] start service after filesystemressource

7 matches

Site Navigation

Mail list logo

Footer information