Dean Cording wrote:
I've come across an issue with the way that rdiff-backup ensures that only one server is accessing a backup dataset.
...
Recently I had a backup fail, probably because of a network outage. All subsequent backups refuse to run because rdiff-backup believes the failed rdiff- backup instance is still running - even though this is clearly impossible because it is a totally different instance of the virtual server.

This had me stumped for a while but I finally figured out what is happening.

Because I start a new virtual server instance each time and I run the backup from a script, everything happens in a consistent order. As a result the instance of rdiff-backup running on the server for each backup session almost always has the same PID. So when a backup fails, the subsequent backup looks at the metadata, finds the PID of the failed backup and sees that that PID is still running - not realising that the other instance is actually itself.

A cursory look at regress.py seems to confirm this behavior: Specifically in check_pids() it says:

    if pid is not None and pid_running(pid):

This could say:

    if pid is not None and pid is not os.getpid() and pid_running(pid):


I'm not sure of a way of working around this problem as the virtual machine is always started from a known state and hasn't been running long enough to build up any entropy to generate unique random numbers between different sessions.

The current time adds a little randomness. A silly workaround would be to call the following perl script before running rdiff-backup:

#!/usr/bin/perl
`/bin/true` for 0..int(rand(100));

This will increase the pid and should stop your job from failing continuously.

Steven


_______________________________________________
rdiff-backup-users mailing list at [email protected]
http://lists.nongnu.org/mailman/listinfo/rdiff-backup-users
Wiki URL: http://rdiff-backup.solutionsfirst.com.au/index.php/RdiffBackupWiki

Reply via email to