Hi, martin f krafft wrote on 2011-04-17 16:43:07 +0200 [Re: [BackupPC-users] Renaming files causes retransfer?]: > also sprach John Rouillard <rouilj-backu...@renesys.com> [2011.04.17.1625 > +0200]: > > > In terms of backuppc, this means that the files will have to be > > > transferred again, completely, right? > > > > Correct. > > Actually, I just did a test, using iptables to count bytes between > the two hosts, and then renamed a 33M file. backuppc, using rsync, > only transferred 370k. Hence I think that it actually does *not* > transfer the whole file.
it always feels strange to contradict reality, but, in theory, there is no way to get around transferring the file. For the rsync algorithm to work, you need a local reference copy of the file you want to transfer. While you and I know that there *is* a local copy, BackupPC would need to know (a) that there is and (b) where to find it. The only available information at the point in time where this decision needs to be made is the (new) file name. For this, there is no candidate in the reference backup (or any other backup, for that matter). So the file needs to be transferred in full. We'd all like to be able to choose an existing *pool file* as reference - this would save us transfers of *any* file already existing in the pool (e.g. from other hosts). Unfortunately, this is technically not possible without a specialized BackupPC client. > (btw, I also think that what I wrote in > http://comments.gmane.org/gmane.comp.sysutils.backup.backuppc.general/24352 > is wrong, but I shall follow up on this when I have verified my > findings). Is that a backuppc-users thread I somehow missed? I see where your question is going now, so I'll go into a bit more detail (not sure if any of this was already mentioned in that thread). 1.) BackupPC uses already existing transfer methods for the sake of not needing to install anything non-mainstream on the clients. In your case, that is probably ssh + rsync. Consequentially, BackupPC is limited to what the rsync protocol will allow, which does *not* include, "hey, send me the 1st and 8th 128kB chunk of the file before I'll tell you the checksum I have on my side". Such a request just doesn't make any sense for standalone rsync. We need to select a candidate before we can start transferring blocks that don't match (and skip blocks that do). It's really quite obvious, if you think about it, and it only gets more complicated (but doesn't change) if you go into the details of which rsync end plays which role in the file delta exchange. The same is basically true for tar and smb, respectively. The remote end decides what data to transfer (which is whole file or nothing), and you can take it or ignore it, but you can't prevent it from being transferred. 2.) BackupPC reads the first 1MB into memory. It needs to do so to determine the pool file name. That should not be a problem memory-wise. 3.) BackupPC cannot, obviously, read any arbitrary size file into memory. It also wants to avoid unnecessary (possibly extremely large) writes to the pool FS. So it does this: - Determine pool file candidates (possibly several, in case of pool collisions). - Read pool file candidates in parallel with the network transfer. - As soon as something doesn't match, discard the respective candidate. - If that was the last available candidate, copy everything so far (which *did* match) from that candidate to a new file. We need to get this content from somewhere, and the network stream is, obviously, not seekable, so we can't re-get it from there (but then, we don't need to and wouldn't want to, because, hopefully, our local disk is faster ;-). - If the whole candidate file matched our complete network stream, we have a pool match and only need to link to that. 4.) There *was* an attempt to write a specialized BackupPC client (BackupPCd) quite a while back. I believe this was given up for lack of human resources. I always found this matter rather interesting, but I've never gotten around to even taking a look at the code, let alone do anything with it. I hope that clears things up a bit. Regards, Holger ------------------------------------------------------------------------------ Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/