[Replying to multiple messages] also sprach Jeffrey J. Kosowsky <backu...@kosowsky.org> [2010.08.29.0358 +0200]: > You are ignoring pool collisions which are very real and not > altogether infrequent. For example a constant length log or database > file could easily have the same 1st and 8th 128k block but still be > different.
Good example. So let's forget about this case and concentrate on my optimisation suggestion: > > 2. Assuming that the two 128k block checksums and the file size are > > not collision-free (they probably aren't), backuppc should really > > uncompress the pool file and employ rsync's rolling checksum to > > update the file (in memory). If there were any changes, then it > > should write out the NewFile to disk; in the absence of changes, > > it should create the hardlink. > > While I don't understand all the details of rsync checksums, you seem > to be missing the fact that when using rsync on the cpool, the actual > block and full-file rsync checksums are appended to the end of the > cpool file. Therefore, it is not necessary to always uncompress the > file but rather it is sufficient just to read out the stored checksums > (though with checksum caching you can choose to have a predetermined > fraction of the files checked each time). Note I may not be describing > this totally accurately but hopefully you get the point. Yeah, I get the point, but the bottleneck in my case is not the uncompression, but the fact that the peer must send the entire file over a slow link, even though it's already present remotely. also sprach Les Mikesell <lesmikes...@gmail.com> [2010.08.29.0102 +0200]: > On 8/28/10 3:22 PM, martin f krafft wrote: > > also sprach Les Mikesell<lesmikes...@gmail.com> [2010.08.28.2151 +0200]: > >> If it is one or a few files or constrained to a directory that > >> you know you already have backed up locally, why not just > >> exclude it on the remote machines? > > > > It happens regularly. > > But if it is under your control, you might arrange it to be under > an excluded directory. I am dealing with u.s.e.r.s. That stands for: "unpredictable sometimes emotionally regressive species". So no. ;) > > Don't you think BackupPC could be optimised *iff* rsyncp could > > ask the peer mid-transfer to calculate the whole file checksum > > (it could just ask that anyway, but that would increase the > > client load)? > > You are working with a stock rsync on the other end, so I don't > think that's an option - and rsync's checksums aren't the same as > the hash used to build the pool filenames. Okay, let's assume the for a moment that we cannot ask the peer to calculate the hashsum mid-transfer. What else could we do? To recap: I would like to avoid having to transfer an entire file if chances are high that it's already in the pool. What I think BackupPC is doing right now is: 1. It starts receiving a file 2. After a certain time, it has enough information to take a guess at the corresponding pool file, and opens it. 3. What seems to happen now is weird: the FileIO method fileDeltaRxNext is called repeatedly, but at the same time, the client keeps sending data, not checksums. See the following demonstration: I created a backup host with just a TESTFILE, 1.5Mb of hex "aabbccddeeff…" and backed it up. I then copied that file to NEWFILE and ran another full backup. 391f184ac1937f245a19652816d10d0e NEWFILE 391f184ac1937f245a19652816d10d0e TESTFILE The following is the strace output of the second run, grepped like this: egrep 'log (Receiving: |tmp/backuppc-test/NEWFILE)|cpool' and interspersed with my comments: 3932 write(8, "log tmp/backuppc-test/NEWFILE: s"..., 65 <unfinished ...> 3928 <... read resumed> "log tmp/backuppc-test/NEWFILE: s"..., 65536) = 65 # The first bytes are arriving, I don't know what 0xfc0f0007 is, but # it appears all over the place. 3932 write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096 3928 <... read resumed> "log Receiving: fc0f0007aabbccdde"..., 65536) = 8192 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: o"..., 126 <unfinished ...> 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60 <unfinished ...> 3932 write(8, "log Receiving: fc0f0007ccddeeffa"..., 4096 <unfinished ...> 3928 <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 8457 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79 <unfinished ...> 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60 <unfinished ...> 3932 write(8, "log Receiving: fc0f0007aabbccdde"..., 4096 <unfinished ...> 3928 <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 4374 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log Receiving: fc0f0007eeffaabbc"..., 4096 <unfinished ...> 3928 <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 12427 […] # A few dozen, equivalent lines later: 3932 write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79 <unfinished ...> 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3928 <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 65536 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 # Oh look, we found a possibly match in the pool: 3932 stat("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b", {st_mode=S_IFREG|0600, st_size=29415, ...}) = 0 3932 open("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b", O_RDONLY) = 6 3932 stat("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b_0", <unfinished ...> # But we keep receiving data (note: "eeffaabbccddeeffaabbccddeeff"), # not checksums: 3928 read(6, "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 139 3932 write(8, "log Receiving: fc0f0007eeffaabbc"..., 4096) = 4096 3928 <... read resumed> "log Receiving: fc0f0007eeffaabbc"..., 65536) = 8192 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log Receiving: fc0f0007ccddeeffa"..., 4096) = 4096 3928 <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 37142 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 # More data: "eeffaabbc" 3932 write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096 3928 <... read resumed> "log tmp/backuppc-test/NEWFILE: b"..., 65536) = 65536 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60 <unfinished ...> 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 […] # Hundreds of lines later, even more data: 3932 write(8, "log Receiving: fc0f0007aabbccdde"..., 4096) = 4096 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log Receiving: fc0f0007eeffaabbc"..., 4096) = 4096 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log Receiving: fc0f0007ccddeeffa"..., 4096) = 4096 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 79) = 79 3932 write(8, "log tmp/backuppc-test/NEWFILE: w"..., 60) = 60 3932 write(8, "log tmp/backuppc-test/NEWFILE: b"..., 75) = 75 # And now, 2Mb have been transferred, so we can finally discard all # the received data and hardlink instead. 3928 read(6, "log Receiving: fc0f0007ccddeeffa"..., 65536) = 65536 3932 write(8, "log tmp/backuppc-test/NEWFILE go"..., 111) = 111 3932 link("/var/lib/backuppc/cpool/1/b/5/1b56172076f0f811087ed07b4c7dda9b", "/srv/backuppc/pc/charade.madduck.net/new/f%2f/ftmp/fbackuppc-test/fNEWFILE") = 0 Do you see what I mean? also sprach Jeffrey J. Kosowsky <backu...@kosowsky.org> [2010.08.29.0404 +0200]: > But the rsync block and file md4 checksums (and yes it's md4 for > the rsync <30 protocol required by perl-File-RsyncP) are appended > to the end of each cpool file. Here's what I think should happen instead: As soon as at least one candidate file in the pool is found: 1. Keep the received data (2×128k) in memory; 2. Somehow convince the peer that we actually have the file and that it should continue sending block checksums, or however the rsync protocol actually works; 3. Compute our own checksums and keep going while they match what the peer sends; 4. If we reach the file's end, hardlink, and be DONE. 5. If we receive a checksum different from what our pool file has (either the checksum cache or by computing checksums over the uncompressed file), then it means that the peer's file is different from what we have in the pool. In this case, we can reconstitute the actual file to this point from the data saved in step (1.), the blocks from the pool file with matching checksums, and what we have just received. 6. Now we have to convince the peer to send real data again. Does this make sense? -- martin | http://madduck.net/ | http://two.sentenc.es/ "there are two major products that come out of berkeley: lsd and unix." one caused me an addiction -- fyodor spamtraps: madduck.bo...@madduck.net
digital_signature_gpg.asc
Description: Digital signature (see http://martin-krafft.net/gpg/)
------------------------------------------------------------------------------ Sell apps to millions through the Intel(R) Atom(Tm) Developer Program Be part of this innovative community and reach millions of netbook users worldwide. Take advantage of special opportunities to increase revenue and speed time-to-market. Join now, and jumpstart your future. http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/