We are only allowed to execute any command once as else we may disturb the synchrony between CRM and LRM.
The following scenario could happen: schedule CRM: deploy task 'migrate' for service vm:100 with UID 1234 schedule LRM: fork task wit UID 123 schedule CRM: idle as no result available yet schedule LRM: collect finished task UID and write result for CRM Result was an error schedule LRM: fork task UID _AGAIN_ schedule CRM: processes _previous_ erroneous result from LRM and place vm:100 in started state on source node schedule LRM: collect finished task UID and write result for CRM This time the task was successful and service is on target node, but the CRM know _nothing_ of it! schedule LRM: try to schedule task but service is not on this node! => error To fix that we _never_ execute two exactly same migrate commands after each other, exactly means here that the UID of the actual command to queue matches the last _finished_ command. This enforces the originally wanted SYN - ACK behaviour between CRM and LRMs. We generate now a new UID for services who does not change state the one of the following evaluates to true: * disabled AND in stopped state * enabled AND in started state This ensures that the state from the CRM holds in the LRM and thus, for example, a killed VM gets restarted. Signed-off-by: Thomas Lamprecht <t.lampre...@proxmox.com> --- changes since v3: * do not restart any task with same UID not only the migrate/relocate once * with this the behaviour of the error state change also so we only log once when a service gets placed in the error state. * remove logging as it spams only the log with no use * the new UID computation, described in the commit message * added return after change to error state in 'next_state_started' to avoid a possible false state changes out of the error state when select_service_node returns another node src/PVE/HA/LRM.pm | 13 +++++++++++++ src/PVE/HA/Manager.pm | 17 +++++++++++------ src/test/test-resource-failure5/log.expect | 2 -- 3 files changed, 24 insertions(+), 8 deletions(-) diff --git a/src/PVE/HA/LRM.pm b/src/PVE/HA/LRM.pm index f53f26d..f0e7977 100644 --- a/src/PVE/HA/LRM.pm +++ b/src/PVE/HA/LRM.pm @@ -31,6 +31,8 @@ sub new { workers => {}, results => {}, restart_tries => {}, + # store finished jobs UID so we don't exec the same twice + processed_uids => {}, shutdown_request => 0, shutdown_errors => 0, # mode can be: active, reboot, shutdown, restart @@ -413,6 +415,7 @@ sub run_workers { $haenv->log('err', $err); } if (defined($w->{uid})) { + $self->{processed_uids}->{$sid} = $w->{uid}; $self->resource_command_finished($sid, $w->{uid}, $res); } else { $self->stop_command_finished($sid, $res); @@ -455,6 +458,15 @@ sub manage_resources { sub queue_resource_command { my ($self, $sid, $uid, $state, $target) = @_; + my $haenv = $self->{haenv}; + my $processed = $self->{processed_uids}; + + # do not queue the excatly same command twice as this may lead to + # an inconsistent HA state when the first command fails but the CRM + # does not process its failure right away and the LRM starts a second + # try, without the CRM knowing of it (race condition) + return if ($uid && $processed->{$sid} && $uid eq $processed->{$sid}); + if (my $w = $self->{workers}->{$sid}) { return if $w->{pid}; # already started # else, delete and overwrite queue entry with new command @@ -482,6 +494,7 @@ sub check_active_workers { my $waitpid = waitpid($pid, WNOHANG); if (defined($waitpid) && ($waitpid == $pid)) { if (defined($w->{uid})) { + $self->{processed_uids}->{$sid} = $w->{uid}; $self->resource_command_finished($sid, $w->{uid}, $?); } else { $self->stop_command_finished($sid, $?); diff --git a/src/PVE/HA/Manager.pm b/src/PVE/HA/Manager.pm index d0031e7..2e5194f 100644 --- a/src/PVE/HA/Manager.pm +++ b/src/PVE/HA/Manager.pm @@ -517,12 +517,14 @@ sub next_state_stopped { } else { $haenv->log('err', "unknown command '$cmd' for service '$sid'"); } - } + } if ($cd->{state} eq 'disabled') { - # do nothing + # tell LRM we are in a synced state + # ensure that it stays stopped + $sd->{uid} = compute_new_uuid($sd->{state}); return; - } + } if ($cd->{state} eq 'enabled') { # simply mark it started, if it's on the wrong node @@ -582,7 +584,8 @@ sub next_state_started { } elsif ($ec == ETRY_AGAIN) { - # do nothing, the LRM wants to try again + # tell LRM we got the result and that he can try again + $sd->{uid} = compute_new_uuid($sd->{state}); } elsif ($ec == ERROR) { # apply our relocate policy if we got ERROR from the LRM @@ -612,6 +615,7 @@ sub next_state_started { " (exit code $ec))"); # we have no save way out (yet) for other errors &$change_service_state($self, $sid, 'error'); + return; } } @@ -627,12 +631,13 @@ sub next_state_started { &$change_service_state($self, $sid, 'relocate', node => $sd->{node}, target => $node); } } else { - # do nothing + # ensure service get started again if it went unexpected down + $sd->{uid} = compute_new_uuid($sd->{state}); } } return; - } + } $haenv->log('err', "service '$sid' - unknown state '$cd->{state}' in service configuration"); } diff --git a/src/test/test-resource-failure5/log.expect b/src/test/test-resource-failure5/log.expect index b6e7807..283ca8c 100644 --- a/src/test/test-resource-failure5/log.expect +++ b/src/test/test-resource-failure5/log.expect @@ -31,8 +31,6 @@ err 143 node2/lrm: unable to start service fa:130 on local node after 1 r err 160 node1/crm: recovery policy for service fa:130 failed, entering error state! info 160 node1/crm: service 'fa:130': state changed from 'started' to 'error' warn 163 node2/lrm: service fa:130 is not running and in an error state -warn 183 node2/lrm: service fa:130 is not running and in an error state -warn 203 node2/lrm: service fa:130 is not running and in an error state info 220 cmdlist: execute service fa:130 disabled info 220 node1/crm: service 'fa:130': state changed from 'error' to 'stopped' info 820 hardware: exit simulation - done -- 2.1.4 _______________________________________________ pve-devel mailing list pve-devel@pve.proxmox.com http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel