Hi, (This thread was originally in -users.)
On 2012‒06‒15 07:33:31+0100, Tyler J. Wagner wrote: > On 2012-06-14 23:41, Steve wrote: > > My novice understanding is that the next release is a "whopper" which > > fundamentally changes many things about how BackupPC works...so it's > > taking a long time. > > I'm sorry to have to cast doubt on this. I have heard the above many times > on this list, but BackupPC development is opaque and centralised in the > main developer. A lot of us write patches, and the Debian packagers seem to > be helping keep BackupPC up to date. But getting those patches upstream > seems impossible, and I'm beginning to wonder if v4 is coming at all. I'm currently experimenting with some ideas of my own. My employer uses BackupPC to backup ~150 unix servers nightly. To work around some performance problems I had to alter the source code a little. While doing this I realized there might be some nice gains to be made by doing certain things differently. The ideas overlap to a limited extent with the ideas[0] that Craig posted to this list. For instance, no more hardlinks, and garbage collection is done using flat-file databases. Some things are quite different. I'll try to explain my ideas here. Observations ============ 1) I/O: writes are fast, reads are slow. Random reads are very slow. Writing can be done asynchronously. Even random writes can be elevatored to the platters fairly efficiently. Reading, on the other hand, is blocking to the process doing it. Running many processes in parallel can alleviate the random-read bottleneck a little but it's still best to do sequential reads whenever possible. 2) CPU and RAM are cheap. BackupPC parallelizes well and for larger installations 12-core or even 24-core systems can be had for reasonable prices. Memory is dirt cheap. File data ========= All files are to be split into chunks of (currently) 2 megabytes. These chunks are addressed by their sha512 hashes (that is, the filename for each chunk is simply the base64 encoded hash). Any compression is applied after splitting. This provides dedup even for large files that change or otherwise differ only slightly. The use of sha512 will eliminate collisions completely, and provide safe dedup without expensive whole-file comparisons. This obviates the need for a lot of disk reading. Pool metadata ============= Each backup consists of a number of databases, one for each share. Each database contains a complete list of files and for each file the file attributes and a list of sha512 hashes that together describe the contents. These databases are of the ‘write once, read many times’ variety, similar to djb's cdb-databases or iso9660 filesystems. This part is already implemented: https://git.fruit.je/hardhat https://git.fruit.je/hardhat-perl Some informal estimations based on real-world data indicate that files take up about 100 bytes of metadata on average in this format. Garbage collection ================== The list of used hashes is collected from each backup by scanning the databases for each share. These are collected into a list for the host they belong to and the lists for each host are collected again into a global list. The actual garbage collection is simply enumerating all files in the chunk pool and checking with the global list if each chunk is still needed. This part is already implemented: https://git.fruit.je/hashlookup-perl These hash lists need to be sorted and merged so that lookups will be efficient. That part is already implemented as well, but doesn't yet have a git repository of its own. Incremental backups =================== Incremental backups use a previous backup as a reference. Any files that have the same timestamp+size are not transferred from the client but instead the file metadata is copied from the previous backup. That means that incremental backups are complete representations of the state of the client. The reference backup is not needed to browse or restore files. Integration =========== The above ideas and code need to be integrated into BackupPC. I've created a git repository for that: https://git.fruit.je/backuppc but the code in that repository is pretty much guaranteed to not work at all, for now. My next target is to create BackupPC::Backup::Reader and BackupPC::Backup::Writer classes similar to the poolreader and poolwriter classes. The reader class might even get a variant that can read v3 backups; useful for migration scenarios. === So, where to go from here? I'd love to hear from Craig what he thinks of all this. As far as I'm aware he has not started work on 4.0 yet, so I'll just take the liberty to continue tinkering with the above ideas in the meanwhile. :) And of course, comments/feedback are more than welcome. Kind regards, -- Wessel Dankers <w...@fruit.je> [0] http://sourceforge.net/mailarchive/message.php?msg_id=27140174 http://sourceforge.net/mailarchive/message.php?msg_id=27140175 http://sourceforge.net/mailarchive/message.php?msg_id=27140176
signature.asc
Description: Digital signature
------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________ BackupPC-devel mailing list BackupPC-devel@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/