applied with fixes - see comment inline:

> On April 13, 2018 at 12:24 PM Wolfgang Link <w.l...@proxmox.com> wrote:
> 
> 
> In certain high-load scenarios ANY ZFS operation can block,
> including registering an (async) destroy.
> Since ZFS operations are implemented via ioctl's,
> killing the user space process
> does not affect the waiting kernel thread processing the ioctl.
> 
> Once "zfs destroy" has been called, killing it does not say anything
> about whether the destroy operation will be aborted or not.
> Since running into a timeout effectively means killing it,
> we don't know whether the snapshot exists afterwards or not.
> We also don't know how long it takes for ZFS to catch up on pending ioctls.
> 
> Given the above problem, we must to not die on errors when deleting a no
> longer needed snapshot fails (under a timeout) after an otherwise
> successful replication. Since we retry on the next run anyway, this is
> not problematic.
> 
> The snapshot deletion error will be logged in the replication log
> and the syslog/journal.
> ---
>  PVE/Replication.pm | 30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
> 
> Patch V2:
> As discuss with Dietmar we will keep remove the former snapshot at the end of
> the run,
> but ignore if it fails. This is to prevent we have 2 replication snapshots at
> time.
> 
> Patch V3:
> Remove cleanup_local_snapshots from eval as Diemar suggest.
> Use Fabian commit message.
> 
> diff --git a/PVE/Replication.pm b/PVE/Replication.pm
> index 9bc4e61..98ba1b6 100644
> --- a/PVE/Replication.pm
> +++ b/PVE/Replication.pm
> @@ -136,8 +136,21 @@ sub prepare {
>               $last_snapshots->{$volid}->{$snap} = 1;
>           } elsif ($snap =~ m/^\Q$prefix\E/) {
>               $logfunc->("delete stale replication snapshot '$snap' on 
> $volid");
> -             PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> -             $cleaned_replicated_volumes->{$volid} = 1;
> +
> +             eval {
> +                 PVE::Storage::volume_snapshot_delete($storecfg, $volid, 
> $snap);
> +                 $cleaned_replicated_volumes->{$volid} = 1;
> +             };
> +
> +             # If deleting the snapshot fails, we can not be sure if it was 
> due to an
> error or a timeout.
> +             # The likelihood that the delete has worked out is high at a 
> timeout.
> +             # If it really fails, it will try to remove on the next run.
> +
> +             # warn is for syslog/journal.
> +             warn $@ if $@;
> +
> +             # logfunc will written in replication log.
> +             $logfunc->("delete stale replication snapshot error: $@") if $@;
>           }
>       }
>      }
> @@ -296,9 +309,18 @@ sub replicate {
>      # remove old snapshots because they are no longer needed
>      $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);
>  
> -    remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids,
> $start_time, $logfunc);
> +    eval {
> +     remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids,
> $start_time, $logfunc);
> +    };
>  
> -    die $err if $err;
> +    # old snapshots will removed by next run from prepare_local_job.
> +    if ($err = $@) {
> +     # warn is for syslog/journal.
> +     warn $err;
> +
> +     # logfunc will written in replication log.
> +     $logfunc->("delete stale replication snapshot error: err");

should be:

        $logfunc->("delete stale replication snapshot error: $err");

> +    }
>  
>      return $volumes;
>  }
> -- 

_______________________________________________
pve-devel mailing list
pve-devel@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Reply via email to