Jan Rafaj [[EMAIL PROTECTED]] writes: > How about adding a feature to keep the checksums in a berkeley-style > database somewhere on the HDD separately, and with subsequent > mirroring attempts, look to it just for the checksums, so that > the rsync does not need to do checksumming of whole target > (already mirrored) file tree ?
There's a chicken and egg issue with this - how do you know that the separately stored checksum accurately reflects the file which it represents? Once they are stored separately they can get out of sync. The natural way to verify the checksum would be to recompute it, but then you're sort of back to square one. I know there have been discussions about this sort of thing on the list in the past. For multiple similar distributions, the rsync+ work (recently incorporated into the mainline rsync in experimental mode - the write-batch and read-batch options) helps remove repeated computations of the checksums and deltas, but it's not a generalized system for any random transfer. I've wanted similar benefits because we use dialup to remote locations and for databases with hundreds of MB or 1-2 GB, we end up wasting a bit of phone time when both sides are just computing checksums. But I'm not sure of a good generalized solution. There may be platform specific hacks (e.g., under NT, storing the computed checksum in a separate stream in the file, so it's guaranteed to be associated with the file), but I don't know of a portable way to link meta information with filesystem files. Note that if you aren't already, be sure that you up the default blocksize for large files - that can cut down significantly on both checksum computation time as well as meta data transferred over the session, since there are fewer blocks that need two checksums (weak + MD4) apiece. > - make output of error & status messages from rsync uniformed, > so that it could be easily parsed by scripts (it is not right > now - rsync 2.5.5) I know Martin has expressed some interest to the list in having something like this in the future as an option. > - perhaps if the network connection between rsync client and server > stalls for some reason, implement something like 'tcp keepalive' > feature ? I think rsync is pretty complicated at the network level already - it seems reasonable to me that rsync ought to be able to assume that the lowest level network protocol stack will get the data to the other end and/or give an error if something goes wrong without needing a lot of babysitting. In all but the rsync server cases, rsync doesn't control the network stream itself anyway (it just has a child process using ssh, rsh or anything else), so it becomes a question for that particular utility and not something rsync can do anything about. In the rsync server case, it already sets the TCP KEEPALIVE option at the socket level when it receives a connection. If your network transport between systems is problematic, there's a limited about of stuff rsync can do about it. Oh and no, just being idle on a session shouldn't terminate it, no matter how long rsync takes to compute checksums. So if that's happening to you, you might want to investigate your network connectivity. Or perhaps you're going through a NAT or some sort of proxy box that places a timeout on TCP sessions that you can increase? Upon failures, if you use --partial and a separate destination directory you can keep re-trying and slowly get the whole file across (that's how we do our backups) but you do still need to recompute checksums each time. It might be nice to see if rsync itself could have a retry mechanism that would re-use the existing checksum information it had computed previously. I have a feeling with the structure of the code at this point though that doing so would be reasonably complicated. The caveat to --partial is that once you have a partial file, even with --compare-dest, that partial file is all rsync considers for the remaining portion of the transfer. So originally for our database backups, I was removing any partial copy manually if it was less than some fraction of the previous copy I already had, since I'd lose less time rebuilding that fraction than losing access to the entire prior file. In response to that, there was another internal-use patch I made to rsync to "--partial-pad" any partial file with data from the original file on the destination system during an error. No guarantees it would work as well, since I just took data from the original file past the size point of the partial copy, but in many cases (growing files) its a big win. If anyone is interested, I could extract it and post it. -- David /-----------------------------------------------------------------------\ \ David Bolen \ E-mail: [EMAIL PROTECTED] / | FitLinxx, Inc. \ Phone: (203) 708-5192 | / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \-----------------------------------------------------------------------/ -- To unsubscribe or change options: http://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html