On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote:
An amazing and simple technology to detect plagiarism is compression-based similarity distance. It is a side-effect of state-of-the-art compression algorithms that can be used to compute a distance for many kind of documents (it seems to work at least for program sources, books, music, DNA etc): take any two files A and B, compress A, compress B, and compress the concatenation of A and B, i.e. AB; take the size of these compressed files c(A), c(B) and c(AB); the similarity distance is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)). Indeed, if documents A and B share information, the compression of AB will be much shorter than c(A) + c(B).
Also see Alex Aiken's "MOSS" (measure of software similarity). It's online, language-specific, works for a variety of languages. Don't know how its algorithm compares to the one here. I suspect it's different insofar the one you describe is language-independent.
John Clements _______________________________________________ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs