Seems to me a bit strange to only have 8 bits to track unique files of the same size.
On a system with a million file that would overload 3906 filesizes. Granted the variety will likely be much more than that but it seems like it could skew the results quite a bit. I'd suggest something like 32 bits, enough to minimize collisions, yet not enough to prove beyond reasonable doubt that it's just a collisoin. Thoughts? zooko wrote: > Folks: > > Alen Peacock wrote to me to point out that adler32 has poor behavior > for small files, and that small files are important. Separately, I > realized that my "8 bits of adler32 of first 8192 bytes" wasn't that > great of a design to minimize the privacy risks. > > Also, Brian Warner wondered aloud if dupfilefind could take a list of > arguments which are the names of directories to examine. > > Therefore, I've uploaded a new version of dupfilefind -- v1.2.0 -- > which emits the first 8 bits of the md5sum of the whole file, and > which takes an optional list of directories to inspect. > > The output from dupfilefind v1.2.0's "--profiles" option is not > comparable to the output from dupfilefind v1.1's "--profiles" option, > so please send me the new version. > > If you have easy_install, you can upgrade with "easy_install -U > dupfilefind". > > I'll compare all the compressed files that people send in and post > the results. If I don't reply to your e-mail please re-send, as I > may have failed to notice it in the tide of spam. > > Thanks! > > Regards, > > Zooko > _______________________________________________ > p2p-hackers mailing list > [email protected] > http://lists.zooko.com/mailman/listinfo/p2p-hackers _______________________________________________ p2p-hackers mailing list [email protected] http://lists.zooko.com/mailman/listinfo/p2p-hackers
