Hi everybody,
I'm pleased to announce that I have published both a paper and an
implementation for our fuzzy hashing. You may have heard me talk
about this on the Cyberspeak podcast[1], and now it's out!
The program, ssdeep, works like md5deep to create a short text
signature for each input file. The signatures can be used to match
other files against the original. Unlike MD5 or SHA-1, however, this
algorithm can match two input files even if they are not exactly the
same. Files match if they have significant homologies, or the same
sequences of bytes in the same order. For example, if file2 is the
same as file1 but with an extra 'A' appended to the end, they match.
If file2 is just the first 33% of file1, they match. If file2 is just
the last 33% of the file1, they match. Lots of little changes between
file1 and file2 won't match, however. Fuzzy hashing is not perfect.
But it is pretty cool!
You'll find the program at http://ssdeep.sourceforge.net/ and the
full academic paper at http://dfrws.org/2006/proceedings/12-
Kornblum.pdf.
Let me know if you have any questions!
[1] The Cyberspeak podcast can be found at http://
cyberspeak.libsyn.com/index.php?post_id=115142
--
Jesse