Please see my comments inline.

On Mon Mar 9, 2026 at 10:57 PM CET, Thomas Lamprecht wrote:

[snip]

> More test ideas welcome, but testing on real clusters would be actually
> even better ;-)

In test src/test/test-disarm-maintenance1/ with cmdlist

    [
        [ "power node1 on", "power node2 on", "power node3 on"],
        [ "crm node3 enable-node-maintenance" ],
        [ "crm node1 disarm-ha freeze" ],
        [ "crm node1 arm-ha" ],
        [ "crm node3 disable-node-maintenance" ]
    ]

the services that have been migrated away from node3 (due to the enabled
node-maintenance) are moved back to node3 upon re-arming HA and
disabling node-maintenance, which is fine.

However, if HA was disarmed with resource mode 'ignore' instead of
'freeze', the services are not moved back upon re-arming HA and
disabling node-maintenance. 

It might be a good idea to:

  - prohibit disarming HA with 'ignore' if at least one node is in
    maintenance mode (see [0]); and
  - add a test correponding to src/test/test-disarm-maintenance1/ but
    with cmdlist:

    [
        [ "power node1 on", "power node2 on", "power node3 on"],
        [ "crm node3 enable-node-maintenance" ],
        [ "crm node1 disarm-ha ignore" ],
        [ "crm node3 disable-node-maintenance" ],
        [ "crm node1 disarm-ha ignore" ],
        [ "crm node1 arm-ha" ]
    ]

where the expectation is that: 

  - the command to disarm HA with 'ignore' will be ignored; 
  - the CRM logs on warn-level that maintenance on node3 ought to be
    disabled first; and
  - after disabling node-maintenance on node3, the command to disarm HA
    with 'ignore' will no longer be ignored.

>

[snip]

> +++ b/src/PVE/HA/Manager.pm
> @@ -75,6 +75,9 @@ sub new {
>      # on change of active master.
>      $self->{ms}->{node_request} = $old_ms->{node_request} if 
> defined($old_ms->{node_request});
>  
> +    # preserve disarm state across CRM master changes
> +    $self->{ms}->{disarm} = $old_ms->{disarm} if defined($old_ms->{disarm});
> +
>      $self->update_crs_scheduler_mode(); # initial set, we update it once 
> every loop
>  
>      return $self;
> @@ -472,7 +475,12 @@ sub update_crm_commands {
>              my $node = $1;
>  
>              my $state = $ns->get_node_state($node);
> -            if ($state eq 'online') {
> +            if ($ms->{disarm}) {
> +                $haenv->log(
> +                    'warn',
> +                    "ignoring maintenance command for node $node - HA stack 
> is disarmed",
> +                );
> +            } elsif ($state eq 'online') {
>                  $ms->{node_request}->{$node}->{maintenance} = 1;
>              } elsif ($state eq 'maintenance') {
>                  $haenv->log(
> @@ -493,6 +501,25 @@ sub update_crm_commands {
>                  );
>              }
>              delete $ms->{node_request}->{$node}->{maintenance}; # gets 
> flushed out at the end of the CRM loop
> +        } elsif ($cmd =~ m/^disarm-ha\s+(freeze|ignore)$/) {
> +            my $mode = $1;
> +
> +            if ($ms->{disarm}) {
> +                $haenv->log(
> +                    'warn',
> +                    "ignoring disarm-ha command - already in disarm state 
> ($ms->{disarm}->{state})",
> +                );
> +            } else {
> +                $haenv->log('info', "got crm command: disarm-ha $mode");

[0] e.g., here adding something along the lines of:

                   if ($mode eq 'ignore') {
                       for my $node (keys %{ $ms->{node_request} }) {
                           if ($ms->{node_request}->{$node}->{maintenance}) {
                               $haenv->log(
                                   'warn',
                                   "ignoring disarm-ha command - "
                                       . "disable maintenance mode on node 
'$node' first",
                               );
                               return;
                           }
                       }
                   }

> +                $ms->{disarm} = { mode => $mode, state => 'disarming' };
> +            }
> +        } elsif ($cmd =~ m/^arm-ha$/) {

[snip]

> +
>  sub manage {
>      my ($self) = @_;
>  
> @@ -657,8 +765,12 @@ sub manage {
>  
>      # compute new service status
>  
> +    # skip service add/remove when disarmed - handle_disarm manages service 
> status
> +    my $is_disarmed = $ms->{disarm};
> +
>      # add new service
>      foreach my $sid (sort keys %$sc) {
> +        next if $is_disarmed;
>          next if $ss->{$sid}; # already there
>          my $cd = $sc->{$sid};
>          next if $cd->{state} eq 'ignored';
> @@ -675,6 +787,7 @@ sub manage {
>  
>      # remove stale or ignored services from manager state
>      foreach my $sid (keys %$ss) {
> +        next if $is_disarmed;
>          next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
>  
>          my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 'no 
> config';

nit: Instead of checking for each $sid whether $is_disarmed, a single
guard could be used to skip service add/remove when disarmed, e.g.:

       # skip service add/remove when disarmed - handle_disarm manages service 
status
       if (!$ms->{disarm}) {
          foreach my $sid (sort keys %$sc) {
              next if $ss->{$sid}; # already there
              my $cd = $sc->{$sid};
              next if $cd->{state} eq 'ignored';
              # ...
          }
      
          # remove stale or ignored services from manager state
          foreach my $sid (keys %$ss) {
              next if $sc->{$sid} && $sc->{$sid}->{state} ne 'ignored';
      
              my $reason = defined($sc->{$sid}) ? 'ignored state requested' : 
'no config';
              # ...
          }
       }

> @@ -713,6 +826,15 @@ sub manage {

[snip]

> diff --git a/src/PVE/HA/Sim/Hardware.pm b/src/PVE/HA/Sim/Hardware.pm
> index 8cbf48d..b4f1d88 100644
> --- a/src/PVE/HA/Sim/Hardware.pm
> +++ b/src/PVE/HA/Sim/Hardware.pm
> @@ -835,6 +835,10 @@ sub sim_hardware_cmd {
>                  || $action eq 'disable-node-maintenance'
>              ) {
>                  $self->queue_crm_commands_nolock("$action $node");
> +            } elsif ($action eq 'disarm-ha') {
> +                $self->queue_crm_commands_nolock("disarm-ha $params[0]");

nit: to increase readability and usability

                   my $mode = $params[0];

                   die "sim_hardware_cmd: unknown resource mode '$mode'"
                       if $mode !~ m/^(freeze|ignore)$/;

                   $self->queue_crm_commands_nolock("disarm-ha $mode");

> +            } elsif ($action eq 'arm-ha') {
> +                $self->queue_crm_commands_nolock("arm-ha");
>              } else {
>                  die "sim_hardware_cmd: unknown action '$action'";
>              }

[snip]



Reply via email to