Hi Bill, On 2012‒08‒07 15:32:47-0700, Bill Broadley wrote: > To this end I've been writing some code to: > * Keep client side sqlite database for metadata and file -> blob > checksum mappings. > * Use all available client side CPU cores (even my $200 tablet has 4) > for compression, encryption, and checksum. Keeping up with a GigE > network connection is non-trivial. > * implement a network protocol (using protobufs) for a dedupe enabled > upload to server > * Protocol allows for "I have these new encrypted blobs, server which > ones don't you have?". > > The above code is written, benchmarked, tested, but certainly not > production ready.
This could be very interesting as an extra protocol next to rsync/smb/etc. > I cringe at the thought of a flat-file database. Seems little reason to > no use sqllite or similar, then let the folks that prefer mysql, > postgres, sql flavor of the day implement whatever is needed. I'm starting to regret ever using the word ‘database’. What I've written is simply something that outputs a list of files in a backup, with an index for quick lookups. It can only be written once and doesn't allow write access. It's not a database in the sense that sqlite or mysql are. > > 1) I/O: writes are fast, reads are slow. Random reads are very slow. > > Not sure in what context you are discussing this. The context of a server that's trying to backup a large number of clients. > It's not an issue > client side (IMO), and server side you mostly want fast metadata reads, > and fast blob writes. After all I'd expect 100s of backups (writes to > the pool) for every read (a restore). Server side you just want to spool data to disk, ideally without reading anything (as that's blocking). > Ideally I could have N disks and put 1/N of the sha512 blobs on each. By ditching hardlinks that's exactly what you'll be able to do. > > File data > > ========= > > > > All files are to be split into chunks of (currently) 2 megabytes. These > > chunks are addressed by their sha512 hashes (that is, the filename for each > > chunk is simply the base64 encoded hash). Any compression is applied after > > splitting. This provides dedup even for large files that change or > > otherwise differ only slightly. > > Seems low, why not 10-100 times that? More sequential I/O is better > right? In most cases I've seen where block level dedupe would be a big > win they were doing something unsafe in the first place. Like say for > instance backing up a large live database. (like Mysql, Zodb, or exchange). I wanted the blocks to fit in ram easily. Simplifies the code a lot. But you may be right, perhaps the chunks could be a lot bigger. > I don't follow how sha512/CAS avoids the need for a lot of disk reading. If you use a small hash (like rc4) you need to compare the files before concluding it's not a collision. This is a lot of disk reading. With sha512 you'd be able to forgo that. > So say a 1GB file on the client changes. With file level dedupe the > client basically has to say: > "Hey server, do you have checksum foo?" > > The server then does a single flat file or sql lookup and says yes or no. > > With 2MB blocks you have to do the above 512 times, right? There's no reason why we can't store a per-file hash in the metadata as well. But with the smaller chunks we can pin-point the changed bit without the server having to read anything but a few k's of metadata. cheers, -- Wessel Dankers <w...@fruit.je>
signature.asc
Description: Digital signature
------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________ BackupPC-devel mailing list BackupPC-devel@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-devel Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/