On Wed, May 03, 2023 at 11:07:20AM -0400, Madison Kelly wrote:
> On 2023-05-03 05:26, 黃暄皓 wrote:
>
> As the title said,is it still in maintenance?
>
> I'm not sure who even owns or maintains that old domain.
We (Linbit) did host that site still,
though all of it was supposed to be "read-only
On Thu, Sep 08, 2022 at 10:11:46AM -0500, Ken Gaillot wrote:
> On Thu, 2022-09-08 at 15:01 +0200, Lars Ellenberg wrote:
> > Scenario:
> > three nodes, no fencing (I know)
> > break network, isolating nodes
> > unbreak network, see how cluster partitions rejoin and resum
Scenario:
three nodes, no fencing (I know)
break network, isolating nodes
unbreak network, see how cluster partitions rejoin and resume service
Funny outcome:
/usr/sbin/crm_mon -x pe-input-689.bz2
Cluster Summary:
* Stack: corosync
* Current DC: mqhavm24 (version 1.1.24.linbit-2.0.el7-8f22
lge> > I would have expected corosync to come back with a "stable
lge> > non‑quorate membership" of just itself within a very short
lge> > period of time, and pacemaker winning the
lge> > "election"/"integration" with just itself, and then trying
lge> > to call "stop" on everything it knows about.
pcmk 2.0.5, corosync 3.1.0, knet, rhel8
I know fencing "solves" this just fine.
what I'd like to understand though is: what exactly is corosync or
pacemaker waiting for here,
why does it not manage to get to the stage where it would even attempt
to "stop" stuff?
two "rings" aka knet interfaces.
n
On Fri, Sep 11, 2020 at 11:42:46AM +0200, Lars Ellenberg wrote:
> On Thu, Sep 10, 2020 at 11:18:58AM -0500, Ken Gaillot wrote:
> > > But for some unrelated reason (stress on the cib, IPC timeout),
> > > crmd on the DC was doing an error exit and was respawned:
> &g
Now with "reproducer" ... see below
On Thu, Sep 10, 2020 at 11:55:20AM +0200, Lars Ellenberg wrote:
> Hi there.
>
> I've seen a scenario where a network "hickup" isolated the current DC in a 3
> node cluster for a short time; other partition elected a new DC
ipefs to just remove the DRBD
specific "magic" used by libblkid to identify it.
The "without Data Loss" part depends on whether the local copy was
"Consistent" (or better yet: UpToDate) before you decided to remove DRBD.
--
: Lars Elle
Hi there.
I've seen a scenario where a network "hickup" isolated the current DC in a 3
node cluster for a short time; other partition elected a new DC obviously, and
all node attributes of the former DC are "cleared" together with the rest of
its state.
All nodes rejoin, "all happy again", BUT ..
quot;dc-deadtime", documented as
"How long to wait for a response from other nodes during startup.",
but the 20s default of that in current Pacemaker is much likely
shorter than what you had as initdead in your "old" setup.
So maybe if you set dc-deadtime to two minutes or somet
able to follow
what it tries to do, and even why.
Other implementations of drbd fencing policy handlers may directly
escalate to node level fencing. If that is what you want, use one of
those, and effectively map every DRBD replication link hickup to a hard
reset of the peer.
--
: Lars Ellenberg
upported
> (rc=-93, origin=kpasterisk01-ha/crmd/###, version=0.45.44)
> Jun 20 13:09:45 kpasterisk02 crmd:error: finalize_sync_callback:
> Sync from kpasterisk01-ha failed: Protocol not supported
> Jun 20 13:09:45 kpasterisk02 crmd: warning: do_log: FSA: In
On Fri, Jan 19, 2018 at 04:52:40PM -0600, Ken Gaillot wrote:
> Your constraints are:
>
> place IP then place drbd instance(s) with it
> start IP then start drbd instance(s)
>
> place drbd master then place fs with it
> promote drbd master then start fs
>
> I'm guessing you meant to colo
To understand some weird behavior we observed,
I dumbed down a production config to three dummy resources,
while keeping some descriptive resource ids (ip, drbd, fs).
For some reason, the constraints are:
stuff, more stuff, IP -> DRBD -> FS -> other stuff.
(In the actual real-world config, it mak
uot; may be to do things differently.
Maybe just set the cluster wide maintenance mode, not per node?
What are you really trying to do,
what is the reason you need it in maintenance-mode
and stop pacemaker/corosync/openais/the clusterstack,
but do not want to stop/migrate off the resources,
as would b
is broken,
relying on a fall-back workaround
which used to "perform" better.
The bug is not that this fall-back workaround now
has pretty printing and is much slower (and eventually times out),
the bug is that you don't properly kill the service first.
[and that yo
logic of trimtester.
It will report that file as corrupted each time.
rm -rf trimtester-is-broken/
mkdir trimtester-is-broken
o=trimtester-is-broken/x1
echo X > $o
l=$o
for i in `seq 2 32`; do
o=trimtester-is-broken/x$i;
cat $l $l > $o ;
rm -f $l;
l=$o;
do
On Wed, Aug 09, 2017 at 06:48:01PM +0200, Lentes, Bernd wrote:
>
>
> - Am 8. Aug 2017 um 15:36 schrieb Lars Ellenberg
> lars.ellenb...@linbit.com:
>
> > crm shell in "auto-commit"?
> > never seen that.
>
> i googled for "crmsh autocommit
t practices how to set up a web server on pacemaker and DRBD"
If you don't have a *very* good reason to use a cluster file
system, for things like web servers, mail servers, file servers,
... most services actually, a "classic" file system as xfs or
ext4 in failover configuration
lockfile (#917)
here goes:
On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote:
> On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote:
> > Note: ocf_take_lock is NOT actually safe to use.
> >
> > As implemented, it uses "echo $pid > lockfile&qu
Yay!
On Mon, May 08, 2017 at 07:50:49PM -0500, Ken Gaillot wrote:
> "crm_attribute --pattern" to update or delete all node
> attributes matching a regular expression
Just a nit, but "pattern" usually is associated with "glob pattern".
If it's not a "pattern" but a "regex",
"--regex" would be more
ver)
> * is it possible to do the opposite? persistent setting "off" and override
> it
> with the transient setting?
see above, also man crm_standby,
which again is only a wrapper around crm_attribute.
--
: Lars Ellenberg
: LINBIT | Keeping the Di
On Tue, Apr 25, 2017 at 10:27:43AM +0200, Jehan-Guillaume de Rorthais wrote:
> On Tue, 25 Apr 2017 10:02:21 +0200
> Lars Ellenberg wrote:
>
> > On Mon, Apr 24, 2017 at 03:08:55PM -0500, Ken Gaillot wrote:
> > > Hi all,
> > >
> > > Pacemaker 1
at the time.
Though that may have been an un-intentional side-effect
of checking both sets of attributes?
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support
DRBD® and LINBIT® are register
On Mon, Apr 24, 2017 at 04:34:07PM +0200, Jehan-Guillaume de Rorthais wrote:
> Hi all,
>
> In the PostgreSQL Automatic Failover (PAF) project, one of most frequent
> negative feedback we got is how difficult it is to experience with it because
> of
> fencing occurring way too frequently. I am cur
Stop polling the cib several times per seconds.
If you have to, "subscribe" to cib updates, using the API.
And stop pushing that much data into the cib.
Maybe, as a stop gap, compress it yourself,
before you stuff it into the cib.
--
: Lars Ellenberg
: LINBIT | Keeping the
On Mon, Mar 06, 2017 at 12:35:18PM -0600, Ken Gaillot wrote:
> diff --git a/lrmd/lrmd.c b/lrmd/lrmd.c
> index 724edb7..39a7dd1 100644
> --- a/lrmd/lrmd.c
> +++ b/lrmd/lrmd.c
> @@ -800,11 +800,40 @@ hb2uniform_rc(const char *action, int rc, const
> char *stdout_data)
>
On Thu, Mar 02, 2017 at 05:31:33PM -0600, Ken Gaillot wrote:
> On 03/01/2017 05:28 PM, Andrew Beekhof wrote:
> > On Tue, Feb 28, 2017 at 12:06 AM, Lars Ellenberg
> > wrote:
> >> When I recently tried to make use of the DEGRADED monitoring results,
> >> I foun
When I recently tried to make use of the DEGRADED monitoring results,
I found out that it does still not work.
Because LRMD choses to filter them in ocf2uniform_rc(),
and maps them to PCMK_OCF_UNKNOWN_ERROR.
See patch suggestion below.
It also filters away the other "special" rc values.
Do we re
get the idea.
Currently, we have SBD chosen as such a "watchdog proxy",
maybe we can generalize it?
All of that would require cooperation within the node itself, though.
In this scenario, the cluster is not trusting the "sanity"
of the "commander in chief".
So maybe i
On Tue, Sep 20, 2016 at 09:43:23AM -0500, Ken Gaillot wrote:
> On 09/20/2016 07:38 AM, Lars Ellenberg wrote:
> > From the point of view of the resource agent,
> > you configured it to use a non-existing network.
> > Which it considers to be a configuration error,
> > wh
tivity, or dealing with removed interface drivers,
or unplugged devices, or whatnot, has to be dealt with elsewhere.
What you did is: down the bond, remove all slave assignments, even
remove the driver, and expect the resource agent to "heal" things that
it does not know about. It can not.
--
addition to the IP.
If you wanted to test-drive cluster response against a
failing network device, your test was wrong.
If you wanted to test-drive cluster response against
a "fat fingered" (or even evil) operator or admin:
give up right there...
You'll never be able to cover it all :-)
gent.
and it was considered not good enough.
That's why we provide the ocf:LINBIT:drbd resource agent.
Which you are supposed to use with DRBD 8.4 on Pacemaker.
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, C
On Wed, Aug 31, 2016 at 12:29:59PM +0200, Dejan Muhamedagic wrote:
> > Also remember that sometimes we set a "local" variable in a function
> > and expect it to be visible in nested functions, but also set a new
> > value in a nested function and expect that value to be reflected
> > in the outer s
On Tue, Aug 30, 2016 at 06:15:49PM +0200, Dejan Muhamedagic wrote:
> On Tue, Aug 30, 2016 at 10:08:00AM -0500, Dmitri Maziuk wrote:
> > On 2016-08-30 03:44, Dejan Muhamedagic wrote:
> >
> > >The kernel reads the shebang line and it is what defines the
> > >interpreter which is to be invoked to run
On Mon, Aug 29, 2016 at 04:37:00PM +0200, Dejan Muhamedagic wrote:
> Hi,
>
> On Mon, Aug 29, 2016 at 02:58:11PM +0200, Gabriele Bulfon wrote:
> > I think the main issue is the usage of the "local" operator in ocf*
> > I'm not an expert on this operator (never used!), don't know how hard it is
> >
nt 10 has type 'ssize_t'
> [-Werror=format=]
That's "just" a format error about ssize_t != int.
See also
https://github.com/ClusterLabs/pacemaker/commit/fc87717
where I already fixed this (and other) format errors.
Of course you could also drop the -Werror,
and hope
But don't.
If you need more than "node-dead" detection,
what you should do for a new system is:
==> use pacemaker on corosync.
Or, if all you are going to manage is a bunch of IP adresses,
maybe you should chose a different tool, VRRP with keepalived
may be better for your need
On Thu, Apr 21, 2016 at 12:50:43PM -0500, Ken Gaillot wrote:
> Hello everybody,
>
> The release cycle for 1.1.15 will be started soon (hopefully tomorrow)!
>
> The most prominent feature will be Klaus Wenninger's new implementation
> of event-driven alerts -- the ability to call scripts whenever
me:0
> TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:500
> RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
# ip addr add 192.168.7.1/24 dev tap0
# ip addr add 192.168.8.1/24 dev tap0 label tap0:jan
# ip addr show dev tap0
And as long a
On Fri, Mar 25, 2016 at 04:08:48PM +, Sam Gardner wrote:
> On 3/25/16, 10:26 AM, "Lars Ellenberg" wrote:
>
>
> >On Thu, Mar 24, 2016 at 09:01:18PM +, Sam Gardner wrote:
> >> I'm having some trouble on a few of my clusters in which the DRBD Slav
e to have some operation fail.
And you should figure out which, when, and why.
Is it the start that fails?
Why does it fail?
Cheers,
Lars
--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support
. After all,
> it does not risk the data, only the automatic cluster recovery, right?
stonith-enabled=false
means:
if some node becomes unresponsive,
it is immediately *assumed* it was "clean" dead.
no fencing takes place,
resource takeover happens without further protection.
That very mu
lab on CentOS 7.2:
> pacemaker-1.1.13-10.el7.x86_64
> corosync-2.3.4-7.el7_2.1.x86_64
>
> Since my latest "yum update", I see the following errors in my logs:
>
> Feb 13 15:22:54 san1.local crmd[1896]:error: pcmkRegisterNode: Triggered
> assert at xml.c:5
gestions. Hoping that perhaps
> someone has successfully done this.
>
> thanks in advance
> -mgb
--
: Lars Ellenberg
: http://www.LINBIT.com
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Hom
to fulfill target-role, and happend to ignore
master-max, trying to promote all instances everywhere ;-)
not set: default behaviour
started: same as not set
slave: do not promote
master: nowadays for ms resources same as "Started" or not set,
but used to trigger some n
d into 1.1.13)
do not work properly with heartbeat, due to changes in pacemaker
upstream which have been incompatible with the heartbeat messaging and
membership layer (especially when it comes to fencing/stonith).
--
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD
drbd_make_request+0x531/0x870 [drbd]
> [] ? throtl_find_tg+0x46/0x60
> [] ? blk_throtl_bio+0x1ea/0x5f0
> [] ? blk_queue_bio+0x494/0x610
> [] ? dm_make_request+0x122/0x180 [dm_mod]
> [] generic_make_request+0x240/0x5a0
> [] ? mempool_alloc_slab+0x15/0x20
> [] ? mempool_
On Wed, Oct 07, 2015 at 05:39:01PM +0200, Lars Ellenberg wrote:
> Something like the below, maybe.
> Untested direct-to-email PoC code.
>
> if echo . | grep -q -I . 2>/dev/null; then
> have_grep_dash_I=true
> else
> have_grep_dash_I=false
> fi
> # simila
t maybe someone wants to mmap a huge file,
# and limiting the virtual size cripples mmap unnecessarily,
# so let's limit resident size instead. Let's be generous, when
# decompressing stuff that was compressed with xz -9, we may
# need ~65 MB according
t;> root 27966 23.0 82.9 3248996 1594688 ? D12:38 0:08
> >>>>> | \_ grep -l -e Starting Pacemaker
Whoa.
grep using up 1.5 gig resident (3.2 gig virtual) still looking for
the first newline.
I suggest in addition to the (good) suggestions so far,
to a
n-dash: X="default"
dash-only: X="default"
So, unless you happen to have an explicitly set to the empty string
OCF_RESKEY_syslog_ng_binary in your environment, things work just fine.
And if you do, then that's the bug.
Which could be worked around by:
> Yes. Interest
w it...
> I deleted the Cluster.conf file and the cib.xml and all the back up versions
> and tried again and got the same error.
> I googled this error and really got nothing. Any ideas?
--
: Lars Ellenberg
: http://www.LINBIT.com | Your Way to High Availability
: DRBD, Linux-HA an
54 matches
Mail list logo