Hi there, Jeremy: I'm adding Cc: the p2p-hackers list because I know some of the p2p hackers would be interested in this experiment.
On Aug 12, 2008, at 20:19 PM, Jeremy Fitzhardinge wrote: (Brian Warner wrote:) >> in some quick tests on allmydata customer data >> we found the space savings to be less than 1%. You might want to >> do some >> tests first (hash all your files, have your friends do the same, >> measure the >> overlap) before worrying about sharing convergence secrets. > Yes, that would be an interesting experiment to perform anyway. >> This prompted me to update my dupfilefind tool to v1.3.0 [1]. To install it, run "easy_install dupfilefind". (If you don't have easy_install installed, follow these installation instructions: [2].) Run dupfilefind with the --profiles option and point it at some directories. It will run for a long time (overnight?) and eventually print to stdout a list of the first 16 bits of the md5sum of each file and the filesize of each file, rounded up to 4096 bits. We can then compare our lists of 16-bit-md5s and filesizes to find out approximately how much data we could save by convergent encryption with one another. Also, dupfilefind is a handy tool for finding identical copies of files on your system. :-) The main difference between dupfilefind 1.3.0 and earlier versions is that now it uses a temporary file instead of RAM for its working state, which means it will now (eventually) finish no matter how many files you point it at. Earlier versions of dupfilefind would sometimes use up all your RAM and then fail. Regards, Zooko [1] http://allmydata.org/trac/dupfilefind [2] http://pypi.python.org/pypi/setuptools/0.6c8#installation- instructions _______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
