Seems to me a bit strange to only have 8 bits to track unique files of the 
same size.

On a system with a million file that would overload 3906 filesizes.  Granted 
the variety will likely be much more than that but it seems like it could skew 
the results quite a bit.

I'd suggest something like 32 bits, enough to minimize collisions, yet not 
enough to prove beyond reasonable doubt that it's just a collisoin.

Thoughts?

zooko wrote:
> Folks:
> 
> Alen Peacock wrote to me to point out that adler32 has poor behavior  
> for small files, and that small files are important.  Separately, I  
> realized that my "8 bits of adler32 of first 8192 bytes" wasn't that  
> great of a design to minimize the privacy risks.
> 
> Also, Brian Warner wondered aloud if dupfilefind could take a list of  
> arguments which are the names of directories to examine.
> 
> Therefore, I've uploaded a new version of dupfilefind -- v1.2.0 --  
> which emits the first 8 bits of the md5sum of the whole file, and  
> which takes an optional list of directories to inspect.
> 
> The output from dupfilefind v1.2.0's "--profiles" option is not  
> comparable to the output from dupfilefind v1.1's "--profiles" option,  
> so please send me the new version.
> 
> If you have easy_install, you can upgrade with "easy_install -U  
> dupfilefind".
> 
> I'll compare all the compressed files that people send in and post  
> the results.  If I don't reply to your e-mail please re-send, as I  
> may have failed to notice it in the tide of spam.
> 
> Thanks!
> 
> Regards,
> 
> Zooko
> _______________________________________________
> p2p-hackers mailing list
> [email protected]
> http://lists.zooko.com/mailman/listinfo/p2p-hackers

_______________________________________________
p2p-hackers mailing list
[email protected]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to