Tested this by recreating the scenario as described, forcing a longer migration time for the second VM by giving it a much larger hard disk. Without the patches applied, vm:100 was migrated back to the the original node while vm:101 was still being migrated. With the patches applied, this no longer happens. vm:100 is migrated to the target node and remains there until vm:101's migration is finished. In general, migrations according to HA rules still seem to work as expected. The code changes also look good to me, thanks for adding new tests!
Consider this: Tested-by: Michael Köppl <[email protected]> Reviewed-by: Michael Köppl <[email protected]> On Mon Nov 3, 2025 at 4:17 PM CET, Daniel Kral wrote: > > NOTE: This fix is based on top of [1], which itself is based on [0]. > > > This fixes an accounting bug, where HA resources in positive affinity > are migrated/relocated back to the (alphabetically-first) source node, > because both the source and target node are considered when evaluating > where a HA resource should be in `select_service_node`. > > > > Example: vm:100 and vm:101 are in a positive resource affinity rule. > > 1. vm:100 is migrated from node1 to node3 > 2. vm:101 will also be migrated from node1 to node3 at the same time > 3. vm:100 finishes migration at least 10 seconds before vm:101 > 4. vm:100 checks for a better node placement > 4a. vm:100 checks whether the positive resource affinity is held and > will get the information that the other HA resources (just vm:101) > is on both node1 and node3 > 4b. In case of equal weights on both nodes, the alphabetically first is > chosen [0] > 5. vm:100 is migrated to node1 > > > > This fix needs changes from [0] as this patch series implements a way to > differentiate between $current_node and $target_node in > get_resource_affinity(...). Since [1] makes changes to that subroutine > too, I rebased on top of [1], even though this fix can also be applied > on top of [0] with some adaption. > > I tried to write the test case a little bit more straight forward by > having a parameter to set a 'migration duration', but that would require > quite a few modifications to the current single-threaded pve-ha-tester, > e.g. a waitqueue which handles "delayed" migration finishes. We could > still do that if we need it for some other test case, but for now > setting up the environment worked fine. > > > > [0] > https://lore.proxmox.com/pve-devel/[email protected]/ > [1] > https://lore.proxmox.com/pve-devel/[email protected]/ > > > Daniel Kral (2): > test: add delayed positive resource affinity migration test case > fix #6801: only consider target node during positive resource affinity > migration > > src/PVE/HA/Rules/ResourceAffinity.pm | 6 ++-- > .../log.expect | 25 +++-------------- > .../log.expect | 28 +++++++++---------- > .../README | 2 ++ > .../cmdlist | 3 ++ > .../hardware_status | 5 ++++ > .../log.expect | 26 +++++++++++++++++ > .../manager_status | 21 ++++++++++++++ > .../rules_config | 3 ++ > .../service_config | 4 +++ > 10 files changed, 86 insertions(+), 37 deletions(-) > create mode 100644 src/test/test-resource-affinity-strict-positive6/README > create mode 100644 src/test/test-resource-affinity-strict-positive6/cmdlist > create mode 100644 > src/test/test-resource-affinity-strict-positive6/hardware_status > create mode 100644 > src/test/test-resource-affinity-strict-positive6/log.expect > create mode 100644 > src/test/test-resource-affinity-strict-positive6/manager_status > create mode 100644 > src/test/test-resource-affinity-strict-positive6/rules_config > create mode 100644 > src/test/test-resource-affinity-strict-positive6/service_config _______________________________________________ pve-devel mailing list [email protected] https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
