we may get an error here if the cluster filesystem is (temporarily) unavailable here, this error resulted in stopping the whole CRM service immediately, which then triggered a node reset (if happened on the current master), even if we had still time left to retry and thus, for example, handle a update of pve-cluster gracefully.
Add a method which wraps the status read in an eval and logs an eventual error, but does not abort the service. Instead we rely on our get_protected_ha_agent_lock method to detect a problem and switch to the lost_agent_lock state. If the pmxcfs outage was really short, so that the manager status read failed but the lock update worked again we update also always before doing real work when in the 'active' state. If this update fails we return from the eval and try next round again, as no point in doing anything without consistent state. Signed-off-by: Thomas Lamprecht <t.lampre...@proxmox.com> --- src/PVE/HA/LRM.pm | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm index 49e9f68..f076735 100644 --- a/src/PVE/HA/LRM.pm +++ b/src/PVE/HA/LRM.pm @@ -136,6 +136,21 @@ sub update_lrm_status { return 1; } +sub update_service_status { + my ($self) = @_; + + my $haenv = $self->{haenv}; + + my $ms = eval { $haenv->read_manager_status(); }; + if (my $err = $@) { + $haenv->log('err', "updating service status from manager failed: $err"); + return undef; + } else { + $self->{service_status} = $ms->{service_status} || {}; + return 1; + } +} + sub get_protected_ha_agent_lock { my ($self) = @_; @@ -215,8 +230,7 @@ sub do_one_iteration { my $status = $self->get_local_status(); my $state = $status->{state}; - my $ms = $haenv->read_manager_status(); - $self->{service_status} = $ms->{service_status} || {}; + $self->update_service_status(); my $fence_request = PVE::HA::Tools::count_fenced_services($self->{service_status}, $haenv->nodename()); @@ -277,6 +291,10 @@ sub do_one_iteration { eval { # fixme: set alert timer + # if we could not get the current service status there's no point + # in doing anything, try again next round. + return if !$self->update_service_status(); + if ($self->{shutdown_request}) { if ($self->{mode} eq 'restart') { -- 2.11.0 _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel