On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote:

An amazing and simple technology to detect plagiarism is
compression-based similarity distance. It is a side-effect
of state-of-the-art compression algorithms that can be used
to compute a distance for many kind of documents (it seems
to work at least for program sources, books, music, DNA etc):
take any two files A and B, compress A, compress B, and compress
the concatenation of A and B, i.e. AB; take the size of these
compressed files c(A), c(B) and c(AB); the similarity distance
is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
Indeed, if documents A and B share information, the compression
of AB will be much shorter than c(A) + c(B).

Also see Alex Aiken's "MOSS" (measure of software similarity). It's online, language-specific, works for a variety of languages. Don't know how its algorithm compares to the one here. I suspect it's different insofar the one you describe is language-independent.

John Clements

_______________________________________________
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs

Reply via email to