Re: [Caml-list] Ocaml clone detector

2009-09-03 Thread John Clements


On Sep 3, 2009, at 8:06 AM, Nicolas barnier wrote:


An amazing and simple technology to detect plagiarism is
compression-based similarity distance. It is a side-effect
of state-of-the-art compression algorithms that can be used
to compute a distance for many kind of documents (it seems
to work at least for program sources, books, music, DNA etc):
take any two files A and B, compress A, compress B, and compress
the concatenation of A and B, i.e. AB; take the size of these
compressed files c(A), c(B) and c(AB); the similarity distance
is simply d(A,B) = 1 - (c(A) + c(B) - c(AB)) / max (c(A), c(B)).
Indeed, if documents A and B share information, the compression
of AB will be much shorter than c(A) + c(B).


Also see Alex Aiken's MOSS (measure of software similarity).  It's  
online, language-specific, works for a variety of languages.  Don't  
know how its algorithm compares to the one here. I suspect it's  
different insofar the one you describe is language-independent.


John Clements

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocaml clone detector

2009-09-02 Thread Erik de Castro Lopo
Kihong Heo wrote:

 I want to know there is a clone detector for Ocaml program.

Maybe it would help if you explained what a clone detector
is.

Erik
-- 
--
Erik de Castro Lopo
http://www.mega-nerd.com/

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocaml clone detector

2009-09-02 Thread Jacques Carette
Erik de Castro Lopo wrote:
 Maybe it would help if you explained what a clone detector
 is.
   
A clone is software-engineering speak for duplicated code.  Exactly
what qualifies as 'duplicated code' and how to efficiently find such
(without too many false positives nor false negatives) is still fairly
active research.  This is a huge issue in languages without decent
abstraction features, and less so otherwise (or so it seems).

Jacques

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocaml clone detector

2009-09-02 Thread Erik de Castro Lopo
Jacques Carette wrote:

 Erik de Castro Lopo wrote:
  Maybe it would help if you explained what a clone detector
  is.

 A clone is software-engineering speak for duplicated code.  Exactly
 what qualifies as 'duplicated code' and how to efficiently find such
 (without too many false positives nor false negatives) is still fairly
 active research.  This is a huge issue in languages without decent
 abstraction features, and less so otherwise (or so it seems).

Thanks for the explanation. 

I can think of two situations where such a clone detector may be useful;
for finding similar chunks of code so that they may be refactored for
software QA management and for detecting plaguarism in student programming
assignments.

I suspect that these two kinds of clone detectors would actually be
quite different.

Cheers,
Erik
-- 
--
Erik de Castro Lopo
http://www.mega-nerd.com/

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs


Re: [Caml-list] Ocaml clone detector

2009-09-02 Thread Andrej Bauer
As far as student plagiarism goes, we found out that for Java
programs, you can pretty much detect frauds by erasing everything from
the programs except parentheses ( ) { } and then comparing the
remaining strings for editing distance. My explanation is that
students who copy code don't want to spend much time on it. In order
to chance the parenthesis they would have to understand the structure
of the program, which they don't.

Andrej

___
Caml-list mailing list. Subscription management:
http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
Archives: http://caml.inria.fr
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
Bug reports: http://caml.inria.fr/bin/caml-bugs