Hello Alexei, Thanks for the detailed reply! I have a little problem to get clarified about the new eligibility. After introducing the new prototype, what RAT should be able to do? Should it be able to find duplicates in a target code( which now CPD is doing) or should it be able to find a given code snippet in the target code?
2009/3/23 Alexei Fedotov <[email protected]> > Hello Amila, > Thanks for your questions. > > 1. As for imports and annotations, the order of them may be a serious proof > for cut & pasted code. Generally I believe that the algorithm should be > language independent, and important structural things like method bodies > should be defined in a form of a grammar or plug-ins. > > 2. If you succeed to find the whole cut&pasted method, and this method is > more or less unique according to the search results, this proves it was > cut&pasted. I agree that it may worth to download the whole project the > method was stolen from and do local in-depth analysis. While it may be > useful and worth designing an extensible tool, I believe this is out of the > scope of the initial GSoC task due to availability of exisitng open source > solutions. The first run of our tool may create a list of URLs for further > in-depth analysis and configure script for CPD [1] to perform it. > > BTW, I have updated the task [2] on required skills and added prefix to the > mail subject. > > [1] http://pmd.sourceforge.net/cpd.html > [2] http://wiki.apache.org/general/SummerOfCode2009#rat-1-cutnpaste > > > On Sat, Mar 21, 2009 at 8:09 AM, Amila De Silva <[email protected]> wrote: > > > Hi all, > > I'm Amila. I'm interested in apache RAT cut&paste detector. After > > having a little discussion with Alexei I felt that > > It's worthy to express my ideas about the project through this mailing > > list. > > After going through the mails I found that you have considered about > > saving search engine resources, whilst detecting a plagarised code. > > > > This is what I thought of; > > 1. Say that we are going to check a whole class against the existing > > codes.As the import statement,annotations are common to most of the > > class > > we should avoid them. So first we are going to remove the import > > statements , annotations (and such common features) and have a clean > > code.This is the one > > we will use for searching afterwards. > > > > 2.Than directly using a sliding window mechanism I sugest this: > > First the code can be broken into components. For a java code these > > components are varaibles, methods,comments,... > > Then we are performing the search using these components. > > Suppose that there are five methods in a particular class(so there are > > five method components). We are taking one method out of those then > > perform a search. > > If it has some result,we compare the component we searched with the > > result and calculate number of similarities between them. I think that > > Levenshtein algorithm( > > which is used to calculate word distances) can be used here to > > determine the similarity between two codes (this value is a one > > heuristic value). > > In a similar manner we perform the search using other components also. > > If there are results then it is compared as above and heuristic value > > is > > added to the first. > > > > When performing a search if any component fails to return a result > > ,then we use sliding window mechanism on that component to determine > > its heuristic value. > > First start by getting the first word in the component(here the > > component is a single method), perform the search ,then add the second > > word to it, perform search > > We do this until we find the largest match. After that similarities > > are calculated. > > > > After performing these component search we'll have some value for a > > particular class file. > > This value can be used to determine if the code has been cut&paste. > > > > I'd like to hear your comments on this and start a bit discussion. > > > > Thanks In Advance! > > Best Regards, > > Amila > > > > > > -- > С уважением, > Алексей Федотов, > http://www.telecom-express.ru/ > http://people.apache.org/~aaf/ <http://people.apache.org/%7Eaaf/> >
