Hi all,
I'm Amila. I'm interested in apache RAT cut&paste detector. After
having a little discussion with Alexei I felt that
It's worthy to express my ideas about the project through this mailing list.
After going through the mails I found that you have considered about
saving search engine resources, whilst detecting a plagarised code.
This is what I thought of;
1. Say that we are going to check a whole class against the existing
codes.As the import statement,annotations are common to most of the
class
we should avoid them. So first we are going to remove the import
statements , annotations (and such common features) and have a clean
code.This is the one
we will use for searching afterwards.
2.Than directly using a sliding window mechanism I sugest this:
First the code can be broken into components. For a java code these
components are varaibles, methods,comments,...
Then we are performing the search using these components.
Suppose that there are five methods in a particular class(so there are
five method components). We are taking one method out of those then
perform a search.
If it has some result,we compare the component we searched with the
result and calculate number of similarities between them. I think that
Levenshtein algorithm(
which is used to calculate word distances) can be used here to
determine the similarity between two codes (this value is a one
heuristic value).
In a similar manner we perform the search using other components also.
If there are results then it is compared as above and heuristic value
is
added to the first.
When performing a search if any component fails to return a result
,then we use sliding window mechanism on that component to determine
its heuristic value.
First start by getting the first word in the component(here the
component is a single method), perform the search ,then add the second
word to it, perform search
We do this until we find the largest match. After that similarities
are calculated.
After performing these component search we'll have some value for a
particular class file.
This value can be used to determine if the code has been cut&paste.
I'd like to hear your comments on this and start a bit discussion.
Thanks In Advance!
Best Regards,
Amila