[pve-devel] [PATCH v3 guest-common 6/7] partially fix #3111: replication: be less picky when selecting incremental base

Fabian Ebner Thu, 12 Aug 2021 04:02:35 -0700

After rollback, it might be necessary to start the replication from an
earlier, possibly non-replication, snapshot, because the replication
snapshot might have been removed from the source node. Previously,
replication could only recover in case the current parent snapshot was
already replicated.


To get into the bad situation (with no replication happening between
the steps):
1. have existing replication
2. take new snapshot
3. rollback to that snapshot
In case the partial fix to only remove blocking replication snapshots
for rollback was already applied, an additional step is necessary to
get into the bad situation:
4. take a second new snapshot

Since non-replication snapshots are now also included, where no
timestamp is readily available, it is necessary to filter them out
when probing for replication snapshots.

If no common replication snapshot is present, iterate backwards
through the config snapshots.

The changes are backwards compatible:

If one side is running the old code, and the other the new code,
the fact that one of the two prepare() calls does not return the
new additional snapshot candidates, means that no new match is
possible. Since the new prepare() returns a superset, no previously
possible match is now impossible.

The branch with @desc_sorted_snap is now taken more often, but
it can still only be taken when the volume exists on the remote side
(and has snapshots). In such cases, it is safe to die if no
incremental base snapshot can be found, because a full sync would not
be possible.

Signed-off-by: Fabian Ebner <f.eb...@proxmox.com>
---

$parent_snapname could now slowly be dropped from prepare(), but I'll
save that for later (it'll take at least until 8.x because of
backwards compatibility anyways).

 src/PVE/Replication.pm | 31 +++++++++++++++++++++++++++++--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/PVE/Replication.pm b/src/PVE/Replication.pm
index 4056ea2..2609ad6 100644
--- a/src/PVE/Replication.pm
+++ b/src/PVE/Replication.pm
@@ -65,10 +65,12 @@ sub find_common_replication_snapshot {
                     ($last_snapshots->{$volid}->{$parent_snapname} &&
                      $remote_snapshots->{$volid}->{$parent_snapname})) {
                $base_snapshots->{$volid} = $parent_snapname;
-           } elsif ($last_sync == 0) {
+           } else {
+               # First, try all replication snapshots.
                my @desc_sorted_snap =
                    map { $_->[1] } sort { $b->[0] <=> $a->[0] }
-                   map { [ ($_ =~ /__replicate_\Q$jobid\E_(\d+)_/)[0] || 0, $_ 
] }
+                   grep { $_->[0] != 0 } # only consider replication snapshots
+                   map { [ ($_ =~ /__replicate_\Q$vmid\E-(?:\d+)_(\d+)_/)[0] 
|| 0, $_ ] }
                    keys %{$remote_snapshots->{$volid}};
 
                foreach my $remote_snap (@desc_sorted_snap) {
@@ -77,6 +79,28 @@ sub find_common_replication_snapshot {
                        last;
                    }
                }
+
+               # Then, try config snapshots ($parent_snapname was already 
tested for above).
+               my $snapname = $parent_snapname // '';
+
+               # Be robust against loop, just in case.
+               # 
https://forum.proxmox.com/threads/snapshot-not-working.69511/post-312281
+               my $max_count = scalar(keys $guest_conf->{snapshots}->%*);
+               for (my $count = 0; $count < $max_count; $count++) {
+                   last if defined($base_snapshots->{$volid});
+
+                   my $snap_conf = $guest_conf->{snapshots}->{$snapname} || 
last;
+                   $snapname = $snap_conf->{parent} // last;
+
+                   if ($last_snapshots->{$volid}->{$snapname} &&
+                       $remote_snapshots->{$volid}->{$snapname})
+                   {
+                       $base_snapshots->{$volid} = $snapname;
+                   }
+               }
+
+               # The volume exists on the remote side, so trying a full sync 
won't work.
+               # Die early with a clean error.
                die "No common base to restore the job state\n".
                    "please delete jobid: $jobid and create the job again\n"
                    if !defined($base_snapshots->{$volid});
@@ -182,6 +206,9 @@ sub prepare {
                } else {
                    $last_snapshots->{$volid}->{$snap} = 1;
                }
+           # Other snapshots might need to serve as replication base after 
rollback
+           } else {
+               $last_snapshots->{$volid}->{$snap} = 1;
            }
        }
     }
-- 
2.30.2



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

[pve-devel] [PATCH v3 guest-common 6/7] partially fix #3111: replication: be less picky when selecting incremental base

Reply via email to