Hello Amila,
Thanks for your questions.

1. As for imports and annotations, the order of them may be a serious proof
for cut & pasted code. Generally I believe that the algorithm should be
language independent, and important structural things like method bodies
should be defined in a form of a grammar or plug-ins.

2. If you succeed to find the whole cut&pasted method, and this method is
more or less unique according to the search results, this proves it was
cut&pasted. I agree that it may worth to download the whole project the
method was stolen from and do local in-depth analysis. While it may be
useful and worth designing an extensible tool, I believe this is out of the
scope of the initial GSoC task due to availability of exisitng open source
solutions. The first run of our tool may create a list of URLs for further
in-depth analysis and configure script for CPD [1] to perform it.

BTW, I have updated the task [2] on required skills and added prefix to the
mail subject.

[1] http://pmd.sourceforge.net/cpd.html
[2] http://wiki.apache.org/general/SummerOfCode2009#rat-1-cutnpaste


On Sat, Mar 21, 2009 at 8:09 AM, Amila De Silva <[email protected]> wrote:

> Hi all,
>        I'm Amila. I'm interested in apache RAT cut&paste detector. After
> having a little discussion with Alexei I felt that
> It's worthy to express my ideas about the project through this mailing
> list.
> After going through the mails I found that you have considered about
> saving search engine resources, whilst detecting a plagarised code.
>
> This is what I thought of;
> 1. Say that we are going to check a whole class against the existing
> codes.As the import statement,annotations are common to most of the
> class
> we should avoid them. So first we are going to remove the import
> statements , annotations (and such common features) and have a clean
> code.This is the one
> we will use for searching afterwards.
>
> 2.Than directly using a sliding window mechanism I sugest this:
> First the code can be broken into components. For a java code these
> components are varaibles, methods,comments,...
> Then we are performing the search using these components.
> Suppose that there are five methods in a particular class(so there are
> five method components). We are taking one method out of those then
> perform a search.
> If it has some result,we compare the component we searched with the
> result and calculate number of similarities between them. I think that
> Levenshtein algorithm(
> which is used to calculate word distances) can be used here to
> determine the similarity between two codes (this value is a one
> heuristic value).
> In a similar manner we perform the search using other components also.
> If there are results then it is compared as above and heuristic value
> is
> added to the first.
>
> When performing a search if any component fails to return a result
> ,then we use sliding window mechanism on that component to determine
> its heuristic value.
> First start by getting the first word in the component(here the
> component is a single method), perform the search ,then add the second
> word to it, perform search
> We do this until we find the largest match. After that similarities
> are calculated.
>
> After performing these component search we'll have some value for a
> particular class file.
> This value can be used to determine if the code has been cut&paste.
>
> I'd like to hear your comments on this and start a bit discussion.
>
> Thanks In Advance!
> Best Regards,
> Amila
>



-- 
С уважением,
Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/

Reply via email to