from:"\"Lars Ellenberg\""

Re: [ClusterLabs] The Linux-HA site is down.

2023-06-01 Thread Lars Ellenberg

On Wed, May 03, 2023 at 11:07:20AM -0400, Madison Kelly wrote: > On 2023-05-03 05:26, 黃暄皓 wrote: > > As the title said,is it still in maintenance? > > I'm not sure who even owns or maintains that old domain. We (Linbit) did host that site still, though all of it was supposed to be "read-only

Re: [ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-14 Thread Lars Ellenberg

On Thu, Sep 08, 2022 at 10:11:46AM -0500, Ken Gaillot wrote: > On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote: > > Scenario: > > three nodes, no fencing (I know) > > break network, isolating nodes > > unbreak network, see how cluster partitions rejoin and resum

[ClusterLabs] DC marks itself as OFFLINE, continues orchestrating the other nodes

2022-09-08 Thread Lars Ellenberg

Scenario: three nodes, no fencing (I know) break network, isolating nodes unbreak network, see how cluster partitions rejoin and resume service Funny outcome: /usr/sbin/crm_mon -x pe-input-689.bz2 Cluster Summary: * Stack: corosync * Current DC: mqhavm24 (version 1.1.24.linbit-2.0.el7-8f22

Re: [ClusterLabs] Antw: [EXT] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

2021-06-02 Thread Lars Ellenberg

lge> > I would have expected corosync to come back with a "stable lge> > non‑quorate membership" of just itself within a very short lge> > period of time, and pacemaker winning the lge> > "election"/"integration" with just itself, and then trying lge> > to call "stop" on everything it knows about.

[ClusterLabs] no-quorum-policy=stop never executed, pacemaker stuck in election/integration, corosync running in "new membership" cycles with itself

2021-06-01 Thread Lars Ellenberg

pcmk 2.0.5, corosync 3.1.0, knet, rhel8 I know fencing "solves" this just fine. what I'd like to understand though is: what exactly is corosync or pacemaker waiting for here, why does it not manage to get to the stage where it would even attempt to "stop" stuff? two "rings" aka knet interfaces. n

Re: [ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

2020-09-17 Thread Lars Ellenberg

On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote: > On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote: > > > But for some unrelated reason (stress on the cib, IPC timeout), > > > crmd on the DC was doing an error exit and was respawned: > &g

Re: [ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after crmd "Respawn" after internal error [NOT cluster partition/rejoin]

2020-09-10 Thread Lars Ellenberg

Now with "reproducer" ... see below On Thu, Sep 10, 2020 at 11:55:20AM +0200, Lars Ellenberg wrote: > Hi there. > > I've seen a scenario where a network "hickup" isolated the current DC in a 3 > node cluster for a short time; other partition elected a new DC

Re: [ClusterLabs] Removing DRBD w/out Data Loss?

2020-09-10 Thread Lars Ellenberg

ipefs to just remove the DRBD specific "magic" used by libblkid to identify it. The "without Data Loss" part depends on whether the local copy was "Consistent" (or better yet: UpToDate) before you decided to remove DRBD. -- : Lars Elle

[ClusterLabs] attrd/cib out of sync, master scores not updated in CIB after cluster partition/rejoin

2020-09-10 Thread Lars Ellenberg

Hi there. I've seen a scenario where a network "hickup" isolated the current DC in a 3 node cluster for a short time; other partition elected a new DC obviously, and all node attributes of the former DC are "cleared" together with the rest of its state. All nodes rejoin, "all happy again", BUT ..

Re: [ClusterLabs] [EXTERNAL] Re: "node is unclean" leads to gratuitous reboot

2019-07-11 Thread Lars Ellenberg

quot;dc-deadtime", documented as "How long to wait for a response from other nodes during startup.", but the 20s default of that in current Pacemaker is much likely shorter than what you had as initdead in your "old" setup. So maybe if you set dc-deadtime to two minutes or somet

Re: [ClusterLabs] Trying to Understanding crm-fence-peer.sh

2019-01-16 Thread Lars Ellenberg

able to follow what it tries to do, and even why. Other implementations of drbd fencing policy handlers may directly escalate to node level fencing. If that is what you want, use one of those, and effectively map every DRBD replication link hickup to a hard reset of the peer. -- : Lars Ellenberg

Re: [ClusterLabs] Corosync.log Growing at 1Gb in 15min

2018-06-20 Thread Lars Ellenberg

upported > (rc=-93, origin=kpasterisk01-ha/crmd/###, version=0.45.44) > Jun 20 13:09:45 kpasterisk02 crmd:error: finalize_sync_callback: > Sync from kpasterisk01-ha failed: Protocol not supported > Jun 20 13:09:45 kpasterisk02 crmd: warning: do_log: FSA: In

Re: [ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD

2018-01-22 Thread Lars Ellenberg

On Fri, Jan 19, 2018 at 04:52:40PM -0600, Ken Gaillot wrote: > Your constraints are: > > place IP then place drbd instance(s) with it > start IP then start drbd instance(s) > > place drbd master then place fs with it > promote drbd master then start fs > > I'm guessing you meant to colo

[ClusterLabs] pengine bug? Recovery after monitor failure: Restart of DRBD does not restart Filesystem -- unless explicit order start before promote on DRBD

2018-01-11 Thread Lars Ellenberg

To understand some weird behavior we observed, I dumbed down a production config to three dummy resources, while keeping some descriptive resource ids (ip, drbd, fs). For some reason, the constraints are: stuff, more stuff, IP -> DRBD -> FS -> other stuff. (In the actual real-world config, it mak

Re: [ClusterLabs] set node in maintenance - stop corosync - node is fenced - is that correct ?

2017-10-16 Thread Lars Ellenberg

uot; may be to do things differently. Maybe just set the cluster wide maintenance mode, not per node? What are you really trying to do, what is the reason you need it in maintenance-mode and stop pacemaker/corosync/openais/the clusterstack, but do not want to stop/migrate off the resources, as would b

Re: [ClusterLabs] Regression in Filesystem RA

2017-10-16 Thread Lars Ellenberg

is broken, relying on a fall-back workaround which used to "perform" better. The bug is not that this fall-back workaround now has pretty printing and is much slower (and eventually times out), the bug is that you don't properly kill the service first. [and that yo

Re: [ClusterLabs] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

2017-10-16 Thread Lars Ellenberg

logic of trimtester. It will report that file as corrupted each time. rm -rf trimtester-is-broken/ mkdir trimtester-is-broken o=trimtester-is-broken/x1 echo X > $o l=$o for i in `seq 2 32`; do o=trimtester-is-broken/x$i; cat $l $l > $o ; rm -f $l; l=$o; do

Re: [ClusterLabs] big trouble with a DRBD resource

2017-08-10 Thread Lars Ellenberg

On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote: > > > - Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg > lars.ellenb...@linbit.com: > > > crm shell in "auto-commit"? > > never seen that. > > i googled for "crmsh autocommit

Re: [ClusterLabs] big trouble with a DRBD resource

2017-08-08 Thread Lars Ellenberg

t practices how to set up a web server on pacemaker and DRBD" If you don't have a *very* good reason to use a cluster file system, for things like web servers, mail servers, file servers, ... most services actually, a "classic" file system as xfs or ext4 in failover configuration

[ClusterLabs] ocf_take_lock is NOT actually safe to use

2017-06-21 Thread Lars Ellenberg

lockfile (#917) here goes: On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote: > On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote: > > Note: ocf_take_lock is NOT actually safe to use. > > > > As implemented, it uses "echo $pid > lockfile&qu

Re: [ClusterLabs] Pacemaker 1.1.17-rc1 now available

2017-05-09 Thread Lars Ellenberg

Yay! On Mon, May 08, 2017 at 07:50:49PM -0500, Ken Gaillot wrote: > "crm_attribute --pattern" to update or delete all node > attributes matching a regular expression Just a nit, but "pattern" usually is associated with "glob pattern". If it's not a "pattern" but a "regex", "--regex" would be more

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-27 Thread Lars Ellenberg

ver) > * is it possible to do the opposite? persistent setting "off" and override > it > with the transient setting? see above, also man crm_standby, which again is only a wrapper around crm_attribute. -- : Lars Ellenberg : LINBIT | Keeping the Di

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-25 Thread Lars Ellenberg

On Tue, Apr 25, 2017 at 10:27:43AM +0200, Jehan-Guillaume de Rorthais wrote: > On Tue, 25 Apr 2017 10:02:21 +0200 > Lars Ellenberg wrote: > > > On Mon, Apr 24, 2017 at 03:08:55PM -0500, Ken Gaillot wrote: > > > Hi all, > > > > > > Pacemaker 1

Re: [ClusterLabs] Coming in Pacemaker 1.1.17: start a node in standby

2017-04-25 Thread Lars Ellenberg

at the time. Though that may have been an un-intentional side-effect of checking both sets of attributes? -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R&D, Integration, Ops, Consulting, Support DRBD® and LINBIT® are register

Re: [ClusterLabs] [ClusterLabs Developers] checking all procs on system enough during stop action?

2017-04-24 Thread Lars Ellenberg

On Mon, Apr 24, 2017 at 04:34:07PM +0200, Jehan-Guillaume de Rorthais wrote: > Hi all, > > In the PostgreSQL Automatic Failover (PAF) project, one of most frequent > negative feedback we got is how difficult it is to experience with it because > of > fencing occurring way too frequently. I am cur

Re: [ClusterLabs] Syncing data and reducing CPU utilization of cib process

2017-04-03 Thread Lars Ellenberg

Stop polling the cib several times per seconds. If you have to, "subscribe" to cib updates, using the API. And stop pushing that much data into the cib. Maybe, as a stop gap, compress it yourself, before you stuff it into the cib. -- : Lars Ellenberg : LINBIT | Keeping the

Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-03-06 Thread Lars Ellenberg

On Mon, Mar 06, 2017 at 12:35:18PM -0600, Ken Gaillot wrote: > diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c > index 724edb7..39a7dd1 100644 > --- a/lrmd/lrmd.c > +++ b/lrmd/lrmd.c > @@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const > char *stdout_data) >

Re: [ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-03-06 Thread Lars Ellenberg

On Thu, Mar 02, 2017 at 05:31:33PM -0600, Ken Gaillot wrote: > On 03/01/2017 05:28 PM, Andrew Beekhof wrote: > > On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg > > wrote: > >> When I recently tried to make use of the DEGRADED monitoring results, > >> I foun

[ClusterLabs] PCMK_OCF_DEGRADED (_MASTER): exit codes are mapped to PCMK_OCF_UNKNOWN_ERROR

2017-02-27 Thread Lars Ellenberg

When I recently tried to make use of the DEGRADED monitoring results, I found out that it does still not work. Because LRMD choses to filter them in ocf2uniform_rc(), and maps them to PCMK_OCF_UNKNOWN_ERROR. See patch suggestion below. It also filters away the other "special" rc values. Do we re

Re: [ClusterLabs] Antw: Re: When the DC crmd is frozen, cluster decisions are delayed infinitely

2016-09-22 Thread Lars Ellenberg

get the idea. Currently, we have SBD chosen as such a "watchdog proxy", maybe we can generalize it? All of that would require cooperation within the node itself, though. In this scenario, the cluster is not trusting the "sanity" of the "commander in chief". So maybe i

Re: [ClusterLabs] OCF_ERR_CONFIGURED (was: Virtual ip resource restarted on node with down network device)

2016-09-20 Thread Lars Ellenberg

On Tue, Sep 20, 2016 at 09:43:23AM -0500, Ken Gaillot wrote: > On 09/20/2016 07:38 AM, Lars Ellenberg wrote: > > From the point of view of the resource agent, > > you configured it to use a non-existing network. > > Which it considers to be a configuration error, > > wh

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Lars Ellenberg

tivity, or dealing with removed interface drivers, or unplugged devices, or whatnot, has to be dealt with elsewhere. What you did is: down the bond, remove all slave assignments, even remove the driver, and expect the resource agent to "heal" things that it does not know about. It can not. --

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Lars Ellenberg

addition to the IP. If you wanted to test-drive cluster response against a failing network device, your test was wrong. If you wanted to test-drive cluster response against a "fat fingered" (or even evil) operator or admin: give up right there... You'll never be able to cover it all :-)

Re: [ClusterLabs] ocf:linbit:drbd Deprecated? Not.

2016-09-16 Thread Lars Ellenberg

gent. and it was considered not good enough. That's why we provide the ocf:LINBIT:drbd resource agent. Which you are supposed to use with DRBD 8.4 on Pacemaker. -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R&D, Integration, Ops, C

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-31 Thread Lars Ellenberg

On Wed, Aug 31, 2016 at 12:29:59PM +0200, Dejan Muhamedagic wrote: > > Also remember that sometimes we set a "local" variable in a function > > and expect it to be visible in nested functions, but also set a new > > value in a nested function and expect that value to be reflected > > in the outer s

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Lars Ellenberg

On Tue, Aug 30, 2016 at 06:15:49PM +0200, Dejan Muhamedagic wrote: > On Tue, Aug 30, 2016 at 10:08:00AM -0500, Dmitri Maziuk wrote: > > On 2016-08-30 03:44, Dejan Muhamedagic wrote: > > > > >The kernel reads the shebang line and it is what defines the > > >interpreter which is to be invoked to run

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Lars Ellenberg

On Mon, Aug 29, 2016 at 04:37:00PM +0200, Dejan Muhamedagic wrote: > Hi, > > On Mon, Aug 29, 2016 at 02:58:11PM +0200, Gabriele Bulfon wrote: > > I think the main issue is the usage of the "local" operator in ocf* > > I'm not an expert on this operator (never used!), don't know how hard it is > >

Re: [ClusterLabs] help compiling pacemaker 1.1 on Amazon Linux

2016-05-06 Thread Lars Ellenberg

nt 10 has type 'ssize_t' > [-Werror=format=] That's "just" a format error about ssize_t != int. See also https://github.com/ClusterLabs/pacemaker/commit/fc87717 where I already fixed this (and other) format errors. Of course you could also drop the -Werror, and hope

Re: [ClusterLabs] Fwd: FW: heartbeat can monitor virtual IP alive or not .

2016-04-28 Thread Lars Ellenberg

But don't. If you need more than "node-dead" detection, what you should do for a new system is: ==> use pacemaker on corosync. Or, if all you are going to manage is a bunch of IP adresses, maybe you should chose a different tool, VRRP with keepalived may be better for your need

Re: [ClusterLabs] Coming in 1.1.15: Event-driven alerts

2016-04-25 Thread Lars Ellenberg

On Thu, Apr 21, 2016 at 12:50:43PM -0500, Ken Gaillot wrote: > Hello everybody, > > The release cycle for 1.1.15 will be started soon (hopefully tomorrow)! > > The most prominent feature will be Klaus Wenninger's new implementation > of event-driven alerts -- the ability to call scripts whenever

Re: [ClusterLabs] getting "Totem is unable to form a cluster" error

2016-04-12 Thread Lars Ellenberg

me:0 > TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:500 > RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) # ip addr add 192.168.7.1/24 dev tap0 # ip addr add 192.168.8.1/24 dev tap0 label tap0:jan # ip addr show dev tap0 And as long a

Re: [ClusterLabs] Set "start-failure-is-fatal=false" on only one resource?

2016-03-25 Thread Lars Ellenberg

On Fri, Mar 25, 2016 at 04:08:48PM +, Sam Gardner wrote: > On 3/25/16, 10:26 AM, "Lars Ellenberg" wrote: > > > >On Thu, Mar 24, 2016 at 09:01:18PM +, Sam Gardner wrote: > >> I'm having some trouble on a few of my clusters in which the DRBD Slav

Re: [ClusterLabs] Set "start-failure-is-fatal=false" on only one resource?

2016-03-25 Thread Lars Ellenberg

e to have some operation fail. And you should figure out which, when, and why. Is it the start that fails? Why does it fail? Cheers, Lars -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R&D, Integration, Ops, Consulting, Support

Re: [ClusterLabs] Pacemaker startup-fencing

2016-03-19 Thread Lars Ellenberg

. After all, > it does not risk the data, only the automatic cluster recovery, right? stonith-enabled=false means: if some node becomes unresponsive, it is immediately *assumed* it was "clean" dead. no fencing takes place, resource takeover happens without further protection. That very mu

Re: [ClusterLabs] Triggered assert at xml.c:594

2016-02-15 Thread Lars Ellenberg

lab on CentOS 7.2: > pacemaker-1.1.13-10.el7.x86_64 > corosync-2.3.4-7.el7_2.1.x86_64 > > Since my latest "yum update", I see the following errors in my logs: > > Feb 13 15:22:54 san1.local crmd[1896]:error: pcmkRegisterNode: Triggered > assert at xml.c:5

Re: [ClusterLabs] [Linux-HA] Anyone successfully install PAcemaker/Corosync on Freebsd?

2016-02-10 Thread Lars Ellenberg

gestions. Hoping that perhaps > someone has successfully done this. > > thanks in advance > -mgb -- : Lars Ellenberg : http://www.LINBIT.com ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Hom

Re: [ClusterLabs] Help needed getting DRBD cluster working

2015-11-30 Thread Lars Ellenberg

to fulfill target-role, and happend to ignore master-max, trying to promote all instances everywhere ;-) not set: default behaviour started: same as not set slave: do not promote master: nowadays for ms resources same as "Started" or not set, but used to trigger some n

Re: [ClusterLabs] BUG in crmd/ccm membership handling when crmd registers with cluster a bit late

2015-11-27 Thread Lars Ellenberg

d into 1.1.13) do not work properly with heartbeat, due to changes in pacemaker upstream which have been incompatible with the heartbeat messaging and membership layer (especially when it comes to fencing/stonith). -- : Lars Ellenberg : http://www.LINBIT.com | Your Way to High Availability : DRBD

Re: [ClusterLabs] gfs2 crashes when i, e.g., dd to a lvm volume

2015-10-09 Thread Lars Ellenberg

drbd_make_request+0x531/0x870 [drbd] > [] ? throtl_find_tg+0x46/0x60 > [] ? blk_throtl_bio+0x1ea/0x5f0 > [] ? blk_queue_bio+0x494/0x610 > [] ? dm_make_request+0x122/0x180 [dm_mod] > [] generic_make_request+0x240/0x5a0 > [] ? mempool_alloc_slab+0x15/0x20 > [] ? mempool_

Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-10-07 Thread Lars Ellenberg

On Wed, Oct 07, 2015 at 05:39:01PM +0200, Lars Ellenberg wrote: > Something like the below, maybe. > Untested direct-to-email PoC code. > > if echo . | grep -q -I . 2>/dev/null; then > have_grep_dash_I=true > else > have_grep_dash_I=false > fi > # simila

Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-10-07 Thread Lars Ellenberg

t maybe someone wants to mmap a huge file, # and limiting the virtual size cripples mmap unnecessarily, # so let's limit resident size instead. Let's be generous, when # decompressing stuff that was compressed with xz -9, we may # need ~65 MB according

Re: [ClusterLabs] Antw: crm_report consumes all available RAM

2015-10-07 Thread Lars Ellenberg

t;> root 27966 23.0 82.9 3248996 1594688 ? D12:38 0:08 > >>>>> | \_ grep -l -e Starting Pacemaker Whoa. grep using up 1.5 gig resident (3.2 gig virtual) still looking for the first newline. I suggest in addition to the (good) suggestions so far, to a

Re: [ClusterLabs] Small bug in RA heartbeat/syslog-ng

2015-09-22 Thread Lars Ellenberg

n-dash: X="default" dash-only: X="default" So, unless you happen to have an explicitly set to the empty string OCF_RESKEY_syslog_ng_binary in your environment, things work just fine. And if you do, then that's the bug. Which could be worked around by: > Yes. Interest

Re: [ClusterLabs] upgrade from 1.1.9 to 1.1.12 fails to start

2015-08-18 Thread Lars Ellenberg

w it... > I deleted the Cluster.conf file and the cib.xml and all the back up versions > and tried again and got the same error. > I googled this error and really got nothing. Any ideas? -- : Lars Ellenberg : http://www.LINBIT.com | Your Way to High Availability : DRBD, Linux-HA an

54 matches

Mail list logo