Re: Extremely poor rsync performance on very large files (near 100GB and larger)
On Mon, 8 Jan 2007, Wayne Davison wrote: On Mon, Jan 08, 2007 at 01:37:45AM -0600, Evan Harris wrote: I've been playing with rsync and very large files approaching and surpassing 100GB, and have found that rsync has excessively very poor performance on these very large files, and the performance appears to degrade the larger the file gets. Yes, this is caused by the current hashing algorithm that the sender uses to find matches for moved data. The current hash table has a fixed size of 65536 slots, and can get overloaded for really large files. ... Would it make more sense just to make rsync pick a more sane blocksize for very large files? I say that without knowing how rsync selects the blocksize, but I'm assuming that if a 65k entry hash table is getting overloaded, it must be using something way too small. Should it be scaling the blocksize with a power-of-2 algorithm rather than the hash table (based on filesize)? I know that may result in more network traffic as a bigger block containing a difference will be considered changed and need to be sent instead of smaller blocks, but in some circumstances wasting a little more network bandwidth may be wholly warranted. Then maybe the hash table size doesn't matter, since there are fewer blocks to check. I haven't tested to see if that would work. Will -B accept a value of something large like 16meg? At my data rates, that's about a half a second of network bandwidth, and seems entirely reasonable. Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Questions and comments regarding --remove-sent-files (Was: New delete option?)
I've looked back through my mailing list archives, and seen a few messages touching on the same things I wanted to mention, but I figured it might be better to recap, since most of them were sent more than a year ago. I have recently started using the --remove-sent-files option, and have noticed a couple of warts. I'm using it to transfer (move really) gigabyte and larger sized files over a fairly slow connection (--bwlimit=10) and with keeping of partial files (--partial) to minimize transfer time in the event of connection problems. Because the individual files may take a day or more each to transfer, rsync interruptions are not uncommon, and I've had several instances where the first run of a transfer aborted in the middle of the non-first file. Although rsync had successfully sent one or more files before losing the connection or being aborted, it doesn't appear to delete the files until a successful end of the whole rsync. A later restart of the rsync sees that some of the files already exist on the destination and need no update, and those files get left on the sending side when they shouldn't. So, I agree with the parent message that either --remove-sent-files should delete the files immediately after they are successfully sent, or a new option should be added (--move maybe?) that does it that way. I saw a followup mailing list message from Wayne that suggested adding the -I option to cause the desired behavior, and that looks like it would be a good workaround. Maybe all that is needed is to make a new --move option be an alias for --remove-sent-files and --ignore-times. Would this be a fairly simple enhancement? The other issue I wanted to touch on was also mentioned on the mailing list, and was how to guard against the possibility that files on the sending side might have been modified during the transfer (which for me sometimes takes a day or more), and for rsync to realize this and avoid deleting the file and losing those changes. I know this one is a more difficult problem, but I just wanted to see if there might be an easy solution. Wayne, thanks for all your work! Evan On Fri, 22 Apr 2005, Wayne Davison wrote: [It appears I missed this message back in February -- ouch.] On Sat, Feb 19, 2005 at 08:53:32PM -0500, Andrew Gideon wrote: FWIW: In the manner I can envision using this, it makes more sense to delete the source as long as the destination file is valid, whether that file moved during this execution or not. This provides a mv function that's safe against a failure. That is an interesting point. The current option allows you to have identical files that didn't get transferred, and thus don't get removed, but that does mean that if the transfer gets interrupted it might do the wrong thing with a file. I'll contemplate what to do going forward since the --remove-sent-files option was already released. Perhaps a --remove-source-files option should be added that works as you suggested. ..wayne.. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Question and feature requests for processor bound systems
Is there any way to disable the checksum block search in rsync, or to somehow optimize it for systems that are processor-bound in addition to being network bound? I'm using rsync on very low power embedded systems to rsync files that are sometimes comparatively large (sometimes a few hundred megs in size or larger), and am finding that just the operation of the checksumming one such file on the sender is taking tens of minutes. The systems in question have processors on the order of a pentium 166, and the tests that I did the other day syncing a single ~500meg file was between 15 and 20 minutes just for the checksum calculation. When these systems are potentially battery powered, the cost of keeping the system up for long periods at full processor utilization is very expensive in power terms. I couldn't find any such option, and I was trying to come up with a way to reduce that cpu-bound problem without completely abandoning rsync. So here are some proposed solutions that I put in as feature requests to help avoid this issue. Option 1: Add an option, maybe --optimize-append, that would optimize the checksum search by telling it that it can assume that files are probably just appended to, like logfiles. This would make rsync not do checksums on the files at all except for very rudimentary checking. I would think a good algoritm might be to checksum only the first and last block of an existing file, and if those two blocks are the same, assume all intervening data is also the same and just transfer the remaining data. This is basically a hint that the file is only being appended to. Then if either of those blocks don't match, fall back to the full checksum algorithm. Option 2: Add an option, maybe --checksum-block-skip=N, that would tell rsync that when checksumming the file, to only checksum every Nth block. This would still allow allow keeping most of the advantages of the rsync, but would allow cpu-bound systems to speed up the checksumming process at the expense of possibly not detecting file differences if the differences fall in between blocks that are checksummed. This would basically be a hint that the only changes the file should contain would be insertions or deletions of data within the file, but no updates of blocks in-place. This would also help on systems that are disk-bound in addition to being network and cpu-bound in that it doesn't have to read every block of the file to send checksums. Option 3: Add an option, maybe --checksum-block-bytes=N, that would tell rsync to only checksym the first N bytes of every block. This would probably be used with a very large --block-size. This would be a hint that the file should have no insertions or deletions of data, but only in-place updates with large blocks, or possibly appended additions. This also would help disk-bound systems. Option 4: Add an option, maybe --optimize-cpu, or --weak-checksums that would tell rsync to only use weak checksums up until the point in the file where the weak checksums first differ, and then fallback to the normal weak and strong checksums from there on. This is a hint that most likely the file is appended to, but will still catch most occurances where a file was modified. All of these options might also benefit from another option that says to only apply these optimizations to files over a certain size, or where the automatic blocksize is over a cerain size. Obviously, these optimizations would all be for systems with comparatively low cpu power, but as average filesizes continue to get larger and larger, they would also benefit even much faster systems when used on very large (several gigabytes and up) files. In the process of testing this, I also found out that the timeout setting I had in the receiver side of ten minutes wasn't sufficient. So I was also wondering if it would be possible to add an option to make rsync, when used in daemon mode and not over another shell transport, use some form of tcp keepalives during long-running processes. This could allow me to reduce the timeout to a smaller value like 2 minutes, but still not let the rsync connection die as long as the remote system still had a live connection even when one end was waiting on the other for very long operations (like this long-running checksum) and there was no other connection traffic. Thoughts? Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question and feature requests for processor bound systems
On Thu, 18 Aug 2005, Jan-Benedict Glaw wrote: By design, rsync trades CPU power for bandwidth. True. But just because that is it's main focus doesn't mean we can't also provide a facility for hinting the types of files being transferred to lessen the impact of that tradeoff for systems that are both bandwidth AND cpu bound. Option 4: Add an option, maybe --optimize-cpu, or --weak-checksums that would tell rsync to only use weak checksums up until the point in the file where the weak checksums first differ, and then fallback to the normal weak and strong checksums from there on. This is a hint that most likely the file is appended to, but will still catch most occurances where a file was modified. Option 4: tar over netcat. How would that not transfer portions of an existing file over again? Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Question and feature requests for processor bound systems
On Thu, 18 Aug 2005, Wayne Davison wrote: The --whole-file option (-W) disables the rsync algorithm entirely, but not the full-file checksum to verify that the file was transferred correctly. Unfortunately, for these huge files, I don't want to retransfer the part that has already been retrieved. The CVS source (also available in the nightly tar files) has the --append option that only transfers files that have gotten longer (or are new), starting the transfer after all the existing data. That should save some checksum processing, but, rsync still includes all the old data in the full-file checksum that verifies that the file was sent correctly. Great! Thanks. Will that be going into the upcoming 2.6.7 version? One question: does it also do a rudimentary check to make sure that the last block that is still present still matches on the sender and receiver, so it can catch files and realize when they have had the data shifted, and go back to the standard sync algo? Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Feature request for rsync for running scripts
On Tue, 19 Jul 2005, Wayne Davison wrote: On Tue, Jul 19, 2005 at 06:27:56PM -0500, Evan Harris wrote: Is it possible that this patch might be added to the mainstream release anytime soon? I was originally against the idea, but have softened my opposition after I saw how self-contained and simple the code turned out to be. One remaining problem with it as it stands now is that it doesn't have the proper configure support (we may need to add a putenv() compatibility function, for instance). If someone would like to help with the cross- platform compatibility of putenv(), that would help to speed this idea's acceptance a bit more. Instead of putting those bits of info in the environment, why not just use the same substitutions as are available in the log format directive in the command string? That would avoid any configure issues, and be just as flexible. Or is that code not easy to reuse in this instance? Is there a way to force creation of any necessary path components of a stem directory of an rsync? Not without using --relative (and all that implies). Rsync will only create the destination directory itself without that. Thanks for the info. Maybe either an option could be added to allow creating the required higher level dirs, or maybe the -R option could be modified to be able to take a numeric parameter that specifies how many levels of the relative path should be removed, similar to what patch does? Thanks. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Feature request for rsync for running scripts
I was wondering if others might find it useful to have a parameter in the rsync daemon config that would allow running a command on the server at session start or at successful rsync completion. For instance, this would allow a webpage to be automatically maintained (by a script called by this method) with the timestamps of the last successful rsync completion (no errors, all files transferred), which would be very nice for keeping track of the state of mirrors. I looked in the patches directory, and see there is already a pre-post-exec.diff patch to allow for running scripts before and after the chroot, but I'm more interested in running after successful rsync completion than after the chroot. The already present patch also doesn't allow for some method of passing other useful information like the username of an authenticated user, or the ipaddress of the rsync client, either of which might be useful in tracking which mirrors were up to date. Other info like the module name or even the number/size of files transferred would also be useful. This would also be very nice for doing things like queuing a backup job for the module that was rsynced in the case of dirs pushed to the server for the purposes of backups. Maybe also a way to bypass the script if no files were transferred, but the rsync was otherwise error-free. Comments appreciated. Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Feature request for rsync for running scripts
On Tue, 19 Jul 2005, Andrew Burgess wrote: On Tue, 19 Jul 2005 12:10:18 -0700, Evan Harris [EMAIL PROTECTED] wrote: I was wondering if others might find it useful to have a parameter in the rsync daemon config that would allow running a command on the server at session start or at successful rsync completion. Couldn't you just do this on the client side? rsync... ssh commands to run after rsync... That requires: 1. Running ssh and enabling it for client traffic. 2. Having a user account for each client and giving it shell access. 3. Giving each client account access to a suid script and/or permission to do the updates itself. Basically, it undoes practially all of the advantages of running an rsync daemon in the first place instead of just using rsync over ssh, as well as potentially creating some big security issues. But the biggest one is that I don't want to give mirror operators/clients shell access. Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Feature request for rsync for running scripts
I've just upgraded it to add more environment variables (such as module name, module path, host name, host IP, user name, exit status). I also changed where the post-rsync exec happens so that both the pre- and post-xfer are both now run by the user that runs the daemon (not the module's user) and without any chroot constraints (the old patch was simpler, and it made the post-xfer command run in a more restricted setting than the pre-xfer command). You can see the latest patch here: http://rsync.samba.org/ftp/unpacked/rsync/patches/pre-post-exec.diff Thank you very much! That sounds like it's just what I'm looking for. Is it possible that this patch might be added to the mainstream release anytime soon? The current patch doesn't include any info on transfer direction nor any transfer stats (which might not be easy to do). I just threw those in there as suggestions. I doubt I'll need them. I did just notice another missing thing (or I've overlooked the option for doing it). Is there a way to force creation of any necessary path components of a stem directory of an rsync? rsync -av /var/base/testing/ testserver::pushdir/beta5/testing/ This command fails with a no such file or directory error if the beta5 directory doesn't exist. I would have expected that with the -a option, it would replicate whatever dirs necessary to do the sync. Or am I missing something? Thanks! -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Verbosity of log messages in daemon mode
When running rsync in daemon mode, is there a way to suppress server-excluded messages in the logfile? I've tried setting both of max verbosity = 0 transfer logging = no but they are still showing up. Rsync 2.6.5. Unfortunately, I'm using rsync to get a whole tree every half hour, but there are a few excluded directories that have several hundreds of files in them, and its making the logs huge. Thanks. Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Re: Verbosity of log messages in daemon mode
Aha, I think I answered my own question. Unfortunately, it appears that the max verbosity setting is a global parameter that can't be in a module area. That might be something to change in the future, or at least make clear in the docs. Also, speaking of messages, when running with -vn I can't seem to figure a way to suppress the non-file messages. It always wants to show: building file list ... done sent 6999 bytes received 16 bytes 14030.00 bytes/sec total size is 423894309 speedup is 60426.84 I just want ONLY the files to be transferred to be given back on stdout, so I can get a good approximation of the total bytes to be transferred to display in a higher level UI. I can include code to try to ignore the additional messages, but that seems kludgy and will break if the messages ever change. Evan On Tue, 19 Jul 2005, Evan Harris wrote: When running rsync in daemon mode, is there a way to suppress server-excluded messages in the logfile? I've tried setting both of max verbosity = 0 transfer logging = no but they are still showing up. Rsync 2.6.5. -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html
Problem with rsync --inplace very slow/hung on large files
I'm trying to rsync a very large (62gig) file from one machine to another as part of a nightly backup. If the file does not exist at the destination, it takes about 2.5 hours to copy in my environment. But, if the file does exist and --inplace is specified, and the file contents differ, rsync either is so significantly slowed as to take more than 30 hours (the longest I've let an instance run), or it is just hung. Running with -vvv gives this as the last few lines of the output: match at 205401064 last_match=205401064 j=821 len=250184 n=0 match at 205651248 last_match=205651248 j=822 len=250184 n=0 match at 205901432 last_match=205901432 j=823 len=250184 n=0 match at 206151616 last_match=206151616 j=824 len=250184 n=0 at which point it has not printed anything else since I last looked at the current run attempt about 8 hours ago. Doing an strace on the rsync processes on the sending and receiving machines it appears that there is still reading and writing going on, but there isn't any output from the -vvv and I can't tell if it's really doing anything. Is this excessive slowness just an artifact of doing an rsync --inplace on such a large file, and it will eventually complete if let run long enough? I would try testing without the --inplace, but the system in question doesn't have enough disk space for two copies of that size file, which is why I am using --inplace. Using 2.6.3, on Debian. Any help appreciated. Evan -- To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html