> Date: Sun, 25 Jan 2009 01:02:15 -0500 > From: Matt McCutchen <m...@mattmccutchen.net>
> I regret the slow response. I was interested in your problem, but I > knew it would take me a while to respond thoughtfully, so I put the > message aside and didn't get back to it until now. I hope this is still > useful. Yes, it is. Thanks. [The immediate need to move the filesystem is gone because the underlying hardware problem has been solved, but eventually I'm going to want to migrate this ext3 to ext4, and the problem will recur at that point. Besides, I'm not the only one who might need to move such extensively-hardlinked filesystems.] > > Okay, so the above shows that --link-dest without -R appears to work, BUT--- > > how come there was no actual output from rsync when it created dst/a/2/foo? > > Correct side-effect (foo created, with correct inode), but incorrect output. > The lack of output here is by design. That's not to say that I think > the design is a good one. I have to confess that I don't, either. (...but see below.) > [ . . . ] > However, the more recently added --copy-dest and --link-dest: > [ . . . ] > have the IMHO more useful interpretation that the basis dir is to be > used as an optimization (of network traffic and/or destination disk > usage), without affecting either the itemization or the final contents > of the destination. I entered an enhancement request for this to be > supported properly: > https://bugzilla.samba.org/show_bug.cgi?id=5645 I see where you're going with that; I assume that such an enhancement would, as fallout, cause itemization of created hardlinks when using a --dest arg. (Right now, they're itemized in a "normal" run with -H but without a --dest, but don't appear if --dest is added, which looks to someone who hasn't followed the entire history like a bug---and makes the output less useful, too.) ...though on the other hand, would this dramatically clutter up the output of a "normal" --link-dest where, typically, one is looking to see which -new- files got transferred as opposed to seeing the creation of a zillion files that were in the basis dirs? (Since you seem to advocate two different options, I guess that would allow users to decide either way.) > [ . . . ] > Right. To recap the problem: In order to transfer both b/2/ and c/2/ to > the proper places under dst/ in a single run, you needed to include the > "b/2/" and "c/2/" path information in the file list by using -R. But > consequently, rsync is going to look for b/2/foo and c/2/foo under > whatever --link-dest dir you specify, and there's no directory on the > destination side that contains files at those paths (yet). So you're saying that there appears to be no way to tell rsync what I want to do in this case---I haven't missed something, and it's either a limitation or a design goal that it works this way. Correct? [Err, except that perhaps you have a solution below; it's just that -R is pretty much useless with any of the --*-dests.] > Tilde expansion is the shell's job. Right, I realized what was going on just after I sent the mail. (I was concentrating on the real problem at hand, of course, and missed that I'd put an = in there, defeating the shell; attributing tilde expansion to anything but the shell must have meant I'd been awake too long. :) > I think using a separate rsync run for each hostX/DATE dir is the way to > go since it's easy to specify an appropriate --link-dest dir, or more > than one. With this approach, you don't need -H unless you want to > preserve hard links among a single host's files on a single day. I do need -H for that reason (there are many crosslinked files in any individual source host---not just in the dirvish vault), but unfortunately doing a separate run for each hostX/DATE combination isn't enough either, which is how I got into this problem---the reason is that there are crosslinks -across- the hosts that I -also- want to preserve. Although perhaps your suggestion below is the solution. (How did this happen? Because after each date's backups, I run faster-dupemerge across all hosts (and across the previous date's run), all at once, e.g. 6 hosts times 2 dates, in my example. This merges files that are the same across hosts [distribution-related stuff, mostly] and also catches files that moved across directories or across hosts---oh, whoops, I just realized I mentioned this the first time, but it bears repeating 'cause it's why this is an unusual case. Not having rsync catch this when I'm copying this giant hierarchy to a new filesystem would undo the work unless I ran f-d on the copy as it was being created, which would increase the time to move everything by quite a lot.) > In recent months, several rsnapshot users have posted about migration > problems similar to yours but one-dimensional (dates only), and I wrote > a script called "rsnapshot-copy" to automate the process of copying the > dates one by one, each time with --link-dest to the previous date: > http://rsnapshot.cvs.sourceforge.net/viewvc/rsnapshot/rsnapshot/utils/rsnapshot-copy?view=markup Yeah, that's basically the script I was about to write; thanks for pointing it out. (As a one-shot, mine would probably have been a very simple shell loop seeded with the output of ls across one of the dirvish vaults.) > You may wish to read the thread from which rsnapshot-copy originated for > more insights: > http://sourceforge.net/mailarchive/forum.php?thread_name=47FBD95C.2080906%40cfa.harvard.edu&forum_name=rsnapshot-discuss Ah. That thread pretty much mirrors what I originally was thinking of, namely using this pivoting --link-dest thing, but then I got hung up on how to also specify the multiple-hosts part of it (not just the multiple-dates part) so the interhost hardlinks wouldn't get snapped. (It also mirrors some other things I thought about, and even tried, such as using dump/restore because this is an ext3fs; unfortunately, even they would run out of physical memory and swapping gets too extreme [and address space, unless run on 64bit], and, worse, restore -nailed- the CPU---who ever thought that restore would be -CPU- bound? Apparently it's got a suboptimal algorithm when handling an extensively hardlinked filesystem and no one ever thought about that or accorded it development effort---the maintainer of ext3/ext4 is a friend of mine and we've been having an entertaining conversation about such filesystems, fsck, and dump/restore across the last month... :) > You could use it as a starting point for a more sophisticated script for > your scenario. Just loop through every (host, date) pair and run rsync, > passing as many --link-dest options as you need to help rsync discover > all the inter-host and inter-date links. > My inclination would be to make the dates the outer loop and the hosts > the inner loop since you have only six hosts but presumably many more > dates. Then, for each hostX/DATE dir, I would --link-dest to each > host's most recent existing dir on the destination. E.g., if you go > through the hosts in alphabetical order, hostC/20080102 would > --link-dest to hostA/20080102 and hostB/20080102 (already copied for > this date) as well as hostC/20080101, hostD/20080101, hostE/20080101, > and hostF/20080101 (hosts not yet copied for 20080102). You could try > this and adjust as you see fit. Hmm. I'll try this out on a small test case and see if it works the way this seems to indicate. Can I rely on rsync -not- doing a complete directory scan of the --link-dest's? E.g., if hostC/20080102 never mentions dir a/b/c, rsync won't bother investigating a/b/c on any of the link-dest's? I would -assume- this is how it works, but I'm checking because one of the hosts has a gazillion files and a complete directory scan of it for one date's backup takes something like half an hour; I'd prefer if this only happened when -that- host's files are the ones actively being copied, as opposed to when the smaller hosts' files are being copied but mention the big host's files in a --link-dest---if rsync only scans those directories that are in common across that pair of hosts, instead of preemptively scanning the entire --link-dest dir up-front, it would save an enormous amount of time. [Unfortunately, the pivoting strategy I was thinking of, and which rsnapshot-copy implements, still wastes a lot of time redundantly rescanning the former target when it becomes the new --link-dest [the old rsync -knew- what the new one must rederive from the FS]; in my case, it approaches an hour per date when talking about all six hosts, so that's 60-100 hours right there. But unless rsync could transfer the old brain into the new brain, or unless it all fit into the filesystem cache, I don't see a reasonable way around this problem.] > Rsync does a linear search through the basis dirs, so you should put the > most likely ones first, e.g., hostC/20080101 in my example. In deciding > how many dirs to pass, consider the benefit of the extra dirs versus the > time that rsync wastes checking those dirs for files that are genuinely > new in the current hostX/DATE dir. Right. > > I -think- I might be able to finesse this by actually physically > > rearranging the directories on the source---risky given that fsck > > is complaiing about it, but maybe... the idea would be to invert > > the organization so that every host is under a date (e.g., instead > > of hostA/date1, hostA/date2, etc, I make it date1/hostA, date1/hostB), > > and then I can specify a SINGLE dir (namely "date1") and not use -R. > > [I can't just specify a single HOST in the current arrangement because > > there are far more dates than hsots and that causes a huge directory > > scan that runs rsync out of memory.] > That would work. To avoid physically rearranging the source, you could > create a structure of symlinks on another filesystem and point rsync to > that. The downside is that you have to use -H to catch cross-host hard > links. I thought about symlinks but didn't want rsync to copy those as well, though if I'd been smarter about it I'd have realized that I could trivially delete them from the destination when everything finished (and I may still try it, or something like it, if the solution you advanced above doesn't work). If a symlink tree (or physical rearrangement) means that rsnapshot-copy is good enough, it's probably faster than my trying to completely test your fancier nested-loop implementation with all the individual --link-dest arguments. Your last sentence above seems to imply that, if I use your solution of multiple --link-dest's (but iterating over every host/date pair as source directories anyway), I -wouldn't- have to use -H to preserve all those hardlinks. I was assuming that I'd need -H regardless of whether --link-dest was in use if I wanted the hardlinks preserved; which is correct? (And would omitting -H if it's unnecessary save me any memory or runtime, or would the use of those --link-dest's make moot? Would the --link-dest's run incrementally and -not- require the "keeping everything in RAM" drawback that using -H imposes?) [Or are you saying "-H with --link-dest is redundant -unless- there are intra-host hardlinks"? In which case, I need -H anyway.] P.S. Probably this should be its own message, but since we're talking --link-dest issues: While I was investigating my possible filesystem corruption issue, I ran rsync with -c (and -H, and all the other usual switches that dirvish et al would use) between some sources & targets to see if any files had been mangled. I was surprised to see (via "lsof | grep rsync"; nothing more sophisticated) that rsync appeared to be inefficient: if the source had a/1, b/2, and c/3 all hardlinked together, rsync appeared to read -all 3- files to compute their checksums, even though they couldn't possibly be different since they were really all the same inodes; I noticed since some of them were quite large and it caused an even larger performance hit than I was expecting from -c. This was an old version (2.6.7 :) and I haven't verified with the latest, but I figured I'd mention it in case you wanted to whip up a quick test case and strace it, or knew it had already been fixed. Thanks again. -- Please use reply-all for most replies to avoid omitting the mailing list. To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/rsync Before posting, read: http://www.catb.org/~esr/faqs/smart-questions.html