An amazing and simple technology to detect plagiarism is compression-based similarity distance. It is a side-effect of state-of-the-art compression algorithms that can be used to compute a distance for many kind of documents (it seems to work at least for program sources, books, music, DNA etc): take any two files A and B, compress A, compress B, and compress the concatenation of A and B, i.e. AB; take the size of these compressed files c(A), c(B) and c(AB); the similarity distance is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)). Indeed, if documents A and B share information, the compression of AB will be much shorter than c(A) + c(B).
A good article (in French unfortunately) can be found at: http://interstices.info/jcms/c_21828/classer-musiques-langues-images-textes-et-genomes where a link points to "Baldr", a free Java application written by a French CS professor to compute this distance pairwise for a set of source codes and sort the result. What appears is that the distance between two documents will be much smaller in case of plagiarism than for any other two (even if good students will tend to produce close source codes for the same exam). I wrote a small Ocaml program (baldml) to perform the same task (but without GUI): ./baldml.opt -algo bz2 -regexp ".+ml$" -n 3 dir where you can choose the compression algorithm among bz2 or gzip, specify a Str-style regexp to filter the files (Ocaml files in the example, but I use it as well for C exam) and the number of sorted lines you want in the output among the n(n-1)/2 unique pairs of the n matching files recursively found in the directory "dir". I can provide the 100 lines of code it if anyone is interested. Hope this helps, -- Nicolas _______________________________________________ Caml-list mailing list. Subscription management: http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list Archives: http://caml.inria.fr Beginner's list: http://groups.yahoo.com/group/ocaml_beginners Bug reports: http://caml.inria.fr/bin/caml-bugs