Les Mikesell wrote at about 07:59:36 -0600 on Wednesday, January 14, 2009: > Johan Ehnberg wrote: > > > >>> OK. I can see now why this is true. But it seems like one could > >>>> rewrite the backuppc rsync protocol to check the pool for a file with > >>>> same checksum before syncing. This could give some real speedup on > >>>> long files. This would be possible at least for the cpool where the > >>>> rsync checksums (and full file checksums) are stored at the end of > >>>> each file. > >>> Now this would be quite the feature - and it fits perfecty with the idea > >>> of smart pooling that BackupPC has. The effects are rather interesting: > >>> > >>> - Different incremental levels won't be needed to preserve bandwidth > >>> - Full backups will indirectly use earlier incrementals as reference > >>> > >>> Definite whishlist item. > >> But you'll have to read through millions of files and the common case of > >> a growing logfile isn't going to find a match anyway. The only way this > >> could work is if the remote rsync could send a starting hash matching > >> the one used to construct the pool filenames - and then you still have > >> to deal with the odds of collisions. > >> > > > > Sure you are pointing to something and are right. What I don't see is > > why we'd have to do an (extra?) read through millions of files? > > You are asking to find an unknown file among millions using a checksum > that is stored at the end. How else would you find it? The normal test > for a match uses the hashed filename to quickly eliminate the > possibilities that aren't hash collisions - this only requires reading a > few of the directories, not each file's contents and is something the OS > can do quickly.
That's why I mentioned in my previous post that having a relational database structure would be very helpful here since the current hard link-based storage approach allows for only a single way of efficiently retrieving pool files (other than by their backup path) and that method depends on a non-standard partial file md5sum. A relational database would allow for pool files to be found based upon any number of attributes or md5sum-type labels. > > > That is > > done with every full anyway, > > No, nothing ever searches the contents of the pool. Fulls compare > against the previously known matching files from that client. > > > and in the case of an incremental it would > > only be necessary for new/changed files. It would in fact also speed up > > those logs because of rotation: an old log changes name but is still > > found on the server. > > On the first rotation that would only be true if the log hadn't grown > since the moment of the last backup. You'd need file chunking to take > advantage of partial matches. After that, a rotation scheme that > attached a timestamp to the filename would make more sense. > > > I suspect there is no problem in getting the hash with some tuning to > > Rsync::Perl? It's just a command as long as the protocol allows it. > > There are two problems. One is that you have a stock rsync at the other > end and at least for the protocols that Rsync::Perl understands there is > not a full hash of the file sent first. The other is that even if it > did, it would have to be computed exactly in the same way that backuppc > does for the pool filenames or you'll spend hours looking up each > match. Are you sure that you can't get rsync to calculate the checksums (both block and full-file) before file transfer begins -- I don't know I'm just asking.. > > > Are collisions aren't exactly a performance problem? BackupPC handles > > them nicely from what I've seen. > > But it must have access to the contents of the file in question to > handle them. It might be possible to do that will an rsync block > compare across the contents but you'd have to repeat it over each hash > match to determine which, if any, have the matching content. It might > not be completely impossible to do remotely, but it would take a well > designed client-server protocol to match up unknown files. > > -- > Les Mikesell > [email protected] > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > BackupPC-users mailing list > [email protected] > List: https://lists.sourceforge.net/lists/listinfo/backuppc-users > Wiki: http://backuppc.wiki.sourceforge.net > Project: http://backuppc.sourceforge.net/ > ------------------------------------------------------------------------------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ BackupPC-users mailing list [email protected] List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/
