Re: Extremely poor rsync performance on very large files (near 100GB and larger)

2007-01-08 Thread Evan Harris


On Mon, 8 Jan 2007, Wayne Davison wrote:


On Mon, Jan 08, 2007 at 01:37:45AM -0600, Evan Harris wrote:


I've been playing with rsync and very large files approaching and
surpassing 100GB, and have found that rsync has excessively very poor
performance on these very large files, and the performance appears to
degrade the larger the file gets.


Yes, this is caused by the current hashing algorithm that the sender
uses to find matches for moved data.  The current hash table has a fixed
size of 65536 slots, and can get overloaded for really large files.
...


Would it make more sense just to make rsync pick a more sane blocksize for 
very large files?  I say that without knowing how rsync selects the 
blocksize, but I'm assuming that if a 65k entry hash table is getting 
overloaded, it must be using something way too small.  Should it be scaling 
the blocksize with a power-of-2 algorithm rather than the hash table (based 
on filesize)?


I know that may result in more network traffic as a bigger block containing 
a difference will be considered changed and need to be sent instead of 
smaller blocks, but in some circumstances wasting a little more network 
bandwidth may be wholly warranted.  Then maybe the hash table size doesn't 
matter, since there are fewer blocks to check.


I haven't tested to see if that would work.  Will -B accept a value of 
something large like 16meg?  At my data rates, that's about a half a second 
of network bandwidth, and seems entirely reasonable.


Evan
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Questions and comments regarding --remove-sent-files (Was: New delete option?)

2006-09-12 Thread Evan Harris


I've looked back through my mailing list archives, and seen a few messages 
touching on the same things I wanted to mention, but I figured it might be 
better to recap, since most of them were sent more than a year ago.


I have recently started using the --remove-sent-files option, and have 
noticed a couple of warts.  I'm using it to transfer (move really) gigabyte 
and larger sized files over a fairly slow connection (--bwlimit=10) and with 
keeping of partial files (--partial) to minimize transfer time in the event 
of connection problems.


Because the individual files may take a day or more each to transfer, rsync 
interruptions are not uncommon, and I've had several instances where the 
first run of a transfer aborted in the middle of the non-first file. 
Although rsync had successfully sent one or more files before losing the 
connection or being aborted, it doesn't appear to delete the files until a 
successful end of the whole rsync.  A later restart of the rsync sees that 
some of the files already exist on the destination and need no update, and 
those files get left on the sending side when they shouldn't.


So, I agree with the parent message that either --remove-sent-files should 
delete the files immediately after they are successfully sent, or a new 
option should be added (--move maybe?) that does it that way.


I saw a followup mailing list message from Wayne that suggested adding the 
-I option to cause the desired behavior, and that looks like it would be a 
good workaround.  Maybe all that is needed is to make a new --move option be 
an alias for --remove-sent-files and --ignore-times.  Would this be a fairly 
simple enhancement?


The other issue I wanted to touch on was also mentioned on the mailing list, 
and was how to guard against the possibility that files on the sending side 
might have been modified during the transfer (which for me sometimes takes a 
day or more), and for rsync to realize this and avoid deleting the file and 
losing those changes.  I know this one is a more difficult problem, but I 
just wanted to see if there might be an easy solution.


Wayne, thanks for all your work!

Evan


On Fri, 22 Apr 2005, Wayne Davison wrote:


[It appears I missed this message back in February -- ouch.]

On Sat, Feb 19, 2005 at 08:53:32PM -0500, Andrew Gideon wrote:

FWIW: In the manner I can envision using this, it makes more sense to
delete the source as long as the destination file is valid, whether
that file moved during this execution or not.  This provides a mv
function that's safe against a failure.


That is an interesting point.  The current option allows you to have
identical files that didn't get transferred, and thus don't get removed,
but that does mean that if the transfer gets interrupted it might do the
wrong thing with a file.

I'll contemplate what to do going forward since the --remove-sent-files
option was already released.  Perhaps a --remove-source-files option
should be added that works as you suggested.

..wayne..
--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html

--
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Question and feature requests for processor bound systems

2005-08-18 Thread Evan Harris

Is there any way to disable the checksum block search in rsync, or to
somehow optimize it for systems that are processor-bound in addition to
being network bound?

I'm using rsync on very low power embedded systems to rsync files that are
sometimes comparatively large (sometimes a few hundred megs in size or
larger), and am finding that just the operation of the checksumming one such
file on the sender is taking tens of minutes.

The systems in question have processors on the order of a pentium 166, and
the tests that I did the other day syncing a single ~500meg file was between
15 and 20 minutes just for the checksum calculation.  When these systems are
potentially battery powered, the cost of keeping the system up for long
periods at full processor utilization is very expensive in power terms.

I couldn't find any such option, and I was trying to come up with a way to
reduce that cpu-bound problem without completely abandoning rsync.  So here
are some proposed solutions that I put in as feature requests to help avoid
this issue.

Option 1: Add an option, maybe --optimize-append, that would optimize the
checksum search by telling it that it can assume that files are probably
just appended to, like logfiles.  This would make rsync not do checksums on
the files at all except for very rudimentary checking.  I would think a good
algoritm might be to checksum only the first and last block of an existing
file, and if those two blocks are the same, assume all intervening data is
also the same and just transfer the remaining data.  This is basically a
hint that the file is only being appended to.  Then if either of those
blocks don't match, fall back to the full checksum algorithm.

Option 2: Add an option, maybe --checksum-block-skip=N, that would tell
rsync that when checksumming the file, to only checksum every Nth block.
This would still allow allow keeping most of the advantages of the rsync,
but would allow cpu-bound systems to speed up the checksumming process at
the expense of possibly not detecting file differences if the differences
fall in between blocks that are checksummed.  This would basically be a hint
that the only changes the file should contain would be insertions or
deletions of data within the file, but no updates of blocks in-place.  This
would also help on systems that are disk-bound in addition to being network
and cpu-bound in that it doesn't have to read every block of the file to
send checksums.

Option 3: Add an option, maybe --checksum-block-bytes=N, that would tell
rsync to only checksym the first N bytes of every block.  This would
probably be used with a very large --block-size.  This would be a hint that
the file should have no insertions or deletions of data, but only in-place
updates with large blocks, or possibly appended additions.  This also would
help disk-bound systems.

Option 4: Add an option, maybe --optimize-cpu, or --weak-checksums that
would tell rsync to only use weak checksums up until the point in the file
where the weak checksums first differ, and then fallback to the normal weak
and strong checksums from there on.  This is a hint that most likely the
file is appended to, but will still catch most occurances where a file was
modified.

All of these options might also benefit from another option that says to
only apply these optimizations to files over a certain size, or where the
automatic blocksize is over a cerain size.

Obviously, these optimizations would all be for systems with comparatively
low cpu power, but as average filesizes continue to get larger and larger,
they would also benefit even much faster systems when used on very large
(several gigabytes and up) files.

In the process of testing this, I also found out that the timeout setting I
had in the receiver side of ten minutes wasn't sufficient.  So I was also
wondering if it would be possible to add an option to make rsync, when used
in daemon mode and not over another shell transport, use some form of tcp
keepalives during long-running processes.  This could allow me to reduce the
timeout to a smaller value like 2 minutes, but still not let the rsync
connection die as long as the remote system still had a live connection
even when one end was waiting on the other for very long operations (like
this long-running checksum) and there was no other connection traffic.

Thoughts?

Evan

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question and feature requests for processor bound systems

2005-08-18 Thread Evan Harris

On Thu, 18 Aug 2005, Jan-Benedict Glaw wrote:

 By design, rsync trades CPU power for bandwidth.

True.  But just because that is it's main focus doesn't mean we can't also
provide a facility for hinting the types of files being transferred to
lessen the impact of that tradeoff for systems that are both bandwidth AND
cpu bound.

  Option 4: Add an option, maybe --optimize-cpu, or --weak-checksums that
  would tell rsync to only use weak checksums up until the point in the file
  where the weak checksums first differ, and then fallback to the normal weak
  and strong checksums from there on.  This is a hint that most likely the
  file is appended to, but will still catch most occurances where a file was
  modified.

 Option 4: tar over netcat.

How would that not transfer portions of an existing file over again?

Evan

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Question and feature requests for processor bound systems

2005-08-18 Thread Evan Harris

On Thu, 18 Aug 2005, Wayne Davison wrote:

 The --whole-file option (-W) disables the rsync algorithm entirely, but
 not the full-file checksum to verify that the file was transferred
 correctly.

Unfortunately, for these huge files, I don't want to retransfer the part
that has already been retrieved.

 The CVS source (also available in the nightly tar files) has the
 --append option that only transfers files that have gotten longer (or
 are new), starting the transfer after all the existing data.  That
 should save some checksum processing, but, rsync still includes all
 the old data in the full-file checksum that verifies that the file
 was sent correctly.

Great!  Thanks.  Will that be going into the upcoming 2.6.7 version?

One question: does it also do a rudimentary check to make sure that the last
block that is still present still matches on the sender and receiver, so it
can catch files and realize when they have had the data shifted, and go
back to the standard sync algo?

Evan

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Feature request for rsync for running scripts

2005-07-20 Thread Evan Harris

On Tue, 19 Jul 2005, Wayne Davison wrote:

 On Tue, Jul 19, 2005 at 06:27:56PM -0500, Evan Harris wrote:
  Is it possible that this patch might be added to the mainstream
  release anytime soon?

 I was originally against the idea, but have softened my opposition after
 I saw how self-contained and simple the code turned out to be.  One
 remaining problem with it as it stands now is that it doesn't have the
 proper configure support (we may need to add a putenv() compatibility
 function, for instance).  If someone would like to help with the cross-
 platform compatibility of putenv(), that would help to speed this idea's
 acceptance a bit more.

Instead of putting those bits of info in the environment, why not just use
the same substitutions as are available in the log format directive in the
command string?  That would avoid any configure issues, and be just as
flexible. Or is that code not easy to reuse in this instance?

  Is there a way to force creation of any necessary path
  components of a stem directory of an rsync?

 Not without using --relative (and all that implies).  Rsync will only
 create the destination directory itself without that.

Thanks for the info.  Maybe either an option could be added to allow
creating the required higher level dirs, or maybe the -R option could be
modified to be able to take a numeric parameter that specifies how many
levels of the relative path should be removed, similar to what patch does?

Thanks.

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Feature request for rsync for running scripts

2005-07-19 Thread Evan Harris

I was wondering if others might find it useful to have a parameter in the
rsync daemon config that would allow running a command on the server at
session start or at successful rsync completion.

For instance, this would allow a webpage to be automatically maintained (by
a script called by this method) with the timestamps of the last successful
rsync completion (no errors, all files transferred), which would be very
nice for keeping track of the state of mirrors.

I looked in the patches directory, and see there is already a
pre-post-exec.diff patch to allow for running scripts before and after the
chroot, but I'm more interested in running after successful rsync completion
than after the chroot.

The already present patch also doesn't allow for some method of passing
other useful information like the username of an authenticated user, or the
ipaddress of the rsync client, either of which might be useful in tracking
which mirrors were up to date.  Other info like the module name or even the
number/size of files transferred would also be useful.

This would also be very nice for doing things like queuing a backup job for
the module that was rsynced in the case of dirs pushed to the server for
the purposes of backups.  Maybe also a way to bypass the script if no files
were transferred, but the rsync was otherwise error-free.

Comments appreciated.

Evan

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Feature request for rsync for running scripts

2005-07-19 Thread Evan Harris

On Tue, 19 Jul 2005, Andrew Burgess wrote:

 On Tue, 19 Jul 2005 12:10:18 -0700, Evan Harris [EMAIL PROTECTED]
 wrote:

  I was wondering if others might find it useful to have a parameter in the
  rsync daemon config that would allow running a command on the server at
  session start or at successful rsync completion.

 Couldn't you just do this on the client side?

 rsync...
 ssh commands to run after rsync...

That requires:

1. Running ssh and enabling it for client traffic.
2. Having a user account for each client and giving it shell access.
3. Giving each client account access to a suid script and/or permission to
do the updates itself.

Basically, it undoes practially all of the advantages of running an rsync
daemon in the first place instead of just using rsync over ssh, as well as
potentially creating some big security issues.  But the biggest one is that
I don't want to give mirror operators/clients shell access.

Evan

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Feature request for rsync for running scripts

2005-07-19 Thread Evan Harris

 I've just upgraded it to add more environment variables (such as module
 name, module path, host name, host IP, user name, exit status).  I also
 changed where the post-rsync exec happens so that both the pre- and
 post-xfer are both now run by the user that runs the daemon (not the
 module's user) and without any chroot constraints (the old patch was
 simpler, and it made the post-xfer command run in a more restricted
 setting than the pre-xfer command).  You can see the latest patch here:

 http://rsync.samba.org/ftp/unpacked/rsync/patches/pre-post-exec.diff

Thank you very much!  That sounds like it's just what I'm looking for.  Is
it possible that this patch might be added to the mainstream release anytime
soon?

 The current patch doesn't include any info on transfer direction nor any
 transfer stats (which might not be easy to do).

I just threw those in there as suggestions.  I doubt I'll need them.

I did just notice another missing thing (or I've overlooked the option for
doing it).  Is there a way to force creation of any necessary path
components of a stem directory of an rsync?

rsync -av /var/base/testing/ testserver::pushdir/beta5/testing/

This command fails with a no such file or directory error if the beta5
directory doesn't exist.  I would have expected that with the -a option,
it would replicate whatever dirs necessary to do the sync.  Or am I missing
something?

Thanks!

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Verbosity of log messages in daemon mode

2005-07-19 Thread Evan Harris

When running rsync in daemon mode, is there a way to suppress
server-excluded messages in the logfile?  I've tried setting both of

max verbosity = 0
transfer logging = no

but they are still showing up.  Rsync 2.6.5.

Unfortunately, I'm using rsync to get a whole tree every half hour, but
there are a few excluded directories that have several hundreds of files in
them, and its making the logs huge.

Thanks.

Evan

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Re: Verbosity of log messages in daemon mode

2005-07-19 Thread Evan Harris

Aha, I think I answered my own question.  Unfortunately, it appears that the
max verbosity setting is a global parameter that can't be in a module area.
That might be something to change in the future, or at least make clear in
the docs.

Also, speaking of messages, when running with -vn I can't seem to figure a
way to suppress the non-file messages.  It always wants to show:

building file list ... done
sent 6999 bytes  received 16 bytes  14030.00 bytes/sec
total size is 423894309  speedup is 60426.84

I just want ONLY the files to be transferred to be given back on stdout, so
I can get a good approximation of the total bytes to be transferred to
display in a higher level UI.

I can include code to try to ignore the additional messages, but that seems
kludgy and will break if the messages ever change.

Evan


On Tue, 19 Jul 2005, Evan Harris wrote:

 When running rsync in daemon mode, is there a way to suppress
 server-excluded messages in the logfile?  I've tried setting both of

 max verbosity = 0
 transfer logging = no

 but they are still showing up.  Rsync 2.6.5.


-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html


Problem with rsync --inplace very slow/hung on large files

2005-03-15 Thread Evan Harris

I'm trying to rsync a very large (62gig) file from one machine to another as
part of a nightly backup.  If the file does not exist at the destination, it
takes about 2.5 hours to copy in my environment.

But, if the file does exist and --inplace is specified, and the file
contents differ, rsync either is so significantly slowed as to take more
than 30 hours (the longest I've let an instance run), or it is just hung.

Running with -vvv gives this as the last few lines of the output:

match at 205401064 last_match=205401064 j=821 len=250184 n=0
match at 205651248 last_match=205651248 j=822 len=250184 n=0
match at 205901432 last_match=205901432 j=823 len=250184 n=0
match at 206151616 last_match=206151616 j=824 len=250184 n=0

at which point it has not printed anything else since I last looked at the
current run attempt about 8 hours ago.

Doing an strace on the rsync processes on the sending and receiving machines
it appears that there is still reading and writing going on, but there isn't
any output from the -vvv and I can't tell if it's really doing anything.

Is this excessive slowness just an artifact of doing an rsync --inplace on
such a large file, and it will eventually complete if let run long enough?

I would try testing without the --inplace, but the system in question
doesn't have enough disk space for two copies of that size file, which is
why I am using --inplace.

Using 2.6.3, on Debian.  Any help appreciated.

Evan

-- 
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html