Hello Amila, Thanks for your questions. 1. As for imports and annotations, the order of them may be a serious proof for cut & pasted code. Generally I believe that the algorithm should be language independent, and important structural things like method bodies should be defined in a form of a grammar or plug-ins.
2. If you succeed to find the whole cut&pasted method, and this method is more or less unique according to the search results, this proves it was cut&pasted. I agree that it may worth to download the whole project the method was stolen from and do local in-depth analysis. While it may be useful and worth designing an extensible tool, I believe this is out of the scope of the initial GSoC task due to availability of exisitng open source solutions. The first run of our tool may create a list of URLs for further in-depth analysis and configure script for CPD [1] to perform it. BTW, I have updated the task [2] on required skills and added prefix to the mail subject. [1] http://pmd.sourceforge.net/cpd.html [2] http://wiki.apache.org/general/SummerOfCode2009#rat-1-cutnpaste On Sat, Mar 21, 2009 at 8:09 AM, Amila De Silva <[email protected]> wrote: > Hi all, > I'm Amila. I'm interested in apache RAT cut&paste detector. After > having a little discussion with Alexei I felt that > It's worthy to express my ideas about the project through this mailing > list. > After going through the mails I found that you have considered about > saving search engine resources, whilst detecting a plagarised code. > > This is what I thought of; > 1. Say that we are going to check a whole class against the existing > codes.As the import statement,annotations are common to most of the > class > we should avoid them. So first we are going to remove the import > statements , annotations (and such common features) and have a clean > code.This is the one > we will use for searching afterwards. > > 2.Than directly using a sliding window mechanism I sugest this: > First the code can be broken into components. For a java code these > components are varaibles, methods,comments,... > Then we are performing the search using these components. > Suppose that there are five methods in a particular class(so there are > five method components). We are taking one method out of those then > perform a search. > If it has some result,we compare the component we searched with the > result and calculate number of similarities between them. I think that > Levenshtein algorithm( > which is used to calculate word distances) can be used here to > determine the similarity between two codes (this value is a one > heuristic value). > In a similar manner we perform the search using other components also. > If there are results then it is compared as above and heuristic value > is > added to the first. > > When performing a search if any component fails to return a result > ,then we use sliding window mechanism on that component to determine > its heuristic value. > First start by getting the first word in the component(here the > component is a single method), perform the search ,then add the second > word to it, perform search > We do this until we find the largest match. After that similarities > are calculated. > > After performing these component search we'll have some value for a > particular class file. > This value can be used to determine if the code has been cut&paste. > > I'd like to hear your comments on this and start a bit discussion. > > Thanks In Advance! > Best Regards, > Amila > -- С уважением, Алексей Федотов, http://www.telecom-express.ru/ http://people.apache.org/~aaf/
