Please see my comments inline.
On Mon Mar 9, 2026 at 10:57 PM CET, Thomas Lamprecht wrote:
[snip]
> More test ideas welcome, but testing on real clusters would be actually
> even better ;-)
In test src/test/test-disarm-maintenance1/ with cmdlist
[
[ "power node1 on", "power node2 on", "power node3 on"],
[ "crm node3 enable-node-maintenance" ],
[ "crm node1 disarm-ha freeze" ],
[ "crm node1 arm-ha" ],
[ "crm node3 disable-node-maintenance" ]
]
the services that have been migrated away from node3 (due to the enabled
node-maintenance) are moved back to node3 upon re-arming HA and
disabling node-maintenance, which is fine.
However, if HA was disarmed with resource mode 'ignore' instead of
'freeze', the services are not moved back upon re-arming HA and
disabling node-maintenance.
It might be a good idea to:
- prohibit disarming HA with 'ignore' if at least one node is in
maintenance mode (see [0]); and
- add a test correponding to src/test/test-disarm-maintenance1/ but
with cmdlist:
[
[ "power node1 on", "power node2 on", "power node3 on"],
[ "crm node3 enable-node-maintenance" ],
[ "crm node1 disarm-ha ignore" ],
[ "crm node3 disable-node-maintenance" ],
[ "crm node1 disarm-ha ignore" ],
[ "crm node1 arm-ha" ]
]
where the expectation is that:
- the command to disarm HA with 'ignore' will be ignored;
- the CRM logs on warn-level that maintenance on node3 ought to be
disabled first; and
- after disabling node-maintenance on node3, the command to disarm HA
with 'ignore' will no longer be ignored.
>
[snip]
> +++ b/src/PVE/HA/Manager.pm
> @@ -75,6 +75,9 @@ sub new {
> # on change of active master.
> $self->{ms}->{node_request} = $old_ms->{node_request} if
> defined($old_ms->{node_request});
>
> + # preserve disarm state across CRM master changes
> + $self->{ms}->{disarm} = $old_ms->{disarm} if defined($old_ms->{disarm});
> +
> $self->update_crs_scheduler_mode(); # initial set, we update it once
> every loop
>
> return $self;
> @@ -472,7 +475,12 @@ sub update_crm_commands {
> my $node = $1;
>
> my $state = $ns->get_node_state($node);
> - if ($state eq 'online') {
> + if ($ms->{disarm}) {
> + $haenv->log(
> + 'warn',
> + "ignoring maintenance command for node $node - HA stack
> is disarmed",
> + );
> + } elsif ($state eq 'online') {
> $ms->{node_request}->{$node}->{maintenance} = 1;
> } elsif ($state eq 'maintenance') {
> $haenv->log(
> @@ -493,6 +501,25 @@ sub update_crm_commands {
> );
> }
> delete $ms->{node_request}->{$node}->{maintenance}; # gets
> flushed out at the end of the CRM loop
> + } elsif ($cmd =~ m/^disarm-ha\s+(freeze|ignore)$/) {
> + my $mode = $1;
> +
> + if ($ms->{disarm}) {
> + $haenv->log(
> + 'warn',
> + "ignoring disarm-ha command - already in disarm state
> ($ms->{disarm}->{state})",
> + );
> + } else {
> + $haenv->log('info', "got crm command: disarm-ha $mode");
[0] e.g., here adding something along the lines of:
if ($mode eq 'ignore') {
for my $node (keys %{ $ms->{node_request} }) {
if ($ms->{node_request}->{$node}->{maintenance}) {
$haenv->log(
'warn',
"ignoring disarm-ha command - "
. "disable maintenance mode on node
'$node' first",
);
return;
}
}
}
> + $ms->{disarm} = { mode => $mode, state => 'disarming' };
> + }
> + } elsif ($cmd =~ m/^arm-ha$/) {
[snip]
> +
> sub manage {
> my ($self) = @_;
>
> @@ -657,8 +765,12 @@ sub manage {
>
> # compute new service status
>
> + # skip service add/remove when disarmed - handle_disarm manages service
> status
> + my $is_disarmed = $ms->{disarm};
> +
> # add new service
> foreach my $sid (sort keys %$sc) {
> + next if $is_disarmed;
> next if $ss->{$sid}; # already there
> my $cd = $sc->{$sid};
> next if $cd->{state} eq 'ignored';
> @@ -675,6 +787,7 @@ sub manage {
>
> # remove stale or ignored services from manager state
> foreach my $sid (keys %$ss) {
> + next if $is_disarmed;
> next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
>
> my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no
> config';
nit: Instead of checking for each $sid whether $is_disarmed, a single
guard could be used to skip service add/remove when disarmed, e.g.:
# skip service add/remove when disarmed - handle_disarm manages service
status
if (!$ms->{disarm}) {
foreach my $sid (sort keys %$sc) {
next if $ss->{$sid}; # already there
my $cd = $sc->{$sid};
next if $cd->{state} eq 'ignored';
# ...
}
# remove stale or ignored services from manager state
foreach my $sid (keys %$ss) {
next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
my $reason = defined($sc->{$sid}) ? 'ignored state requested' :
'no config';
# ...
}
}
> @@ -713,6 +826,15 @@ sub manage {
[snip]
> diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
> index 8cbf48d..b4f1d88 100644
> --- a/src/PVE/HA/Sim/Hardware.pm
> +++ b/src/PVE/HA/Sim/Hardware.pm
> @@ -835,6 +835,10 @@ sub sim_hardware_cmd {
> || $action eq 'disable-node-maintenance'
> ) {
> $self->queue_crm_commands_nolock("$action $node");
> + } elsif ($action eq 'disarm-ha') {
> + $self->queue_crm_commands_nolock("disarm-ha $params[0]");
nit: to increase readability and usability
my $mode = $params[0];
die "sim_hardware_cmd: unknown resource mode '$mode'"
if $mode !~ m/^(freeze|ignore)$/;
$self->queue_crm_commands_nolock("disarm-ha $mode");
> + } elsif ($action eq 'arm-ha') {
> + $self->queue_crm_commands_nolock("arm-ha");
> } else {
> die "sim_hardware_cmd: unknown action '$action'";
> }
[snip]