Hi.
I am working on my project: apache-rat-pd.
Apache RAT plagiarism detector is a command-line tool for searching
the code
base for possibly plagiarized code using web code search engines.
This project is a part of Google Summer of Code 2009. It is mentored
by Apache.

The idea is to query code search engines(like Google Code Search [1],
Koders [2] or Krugle [3])
to check if the code we send in the query is copied from somewhere.
More info about project can be found at http://code.google.com/p/apache-rat-pd/

Our initial plan was to make it to work with Google Code Search first
because it is open for developers, it has custom libraries, and has a
great support for searching by regular expressions.
 So far, I created an initial version of this tool. It queries a part
of code we assume to be plagiarized. We faced some problems and need
your help to resolve them.

Sometimes, when we query Google Code Search with a great number of
queries in
small time amount, the engine starts rejecting our queries.

So we have some questions:

 1. Do Google Code Search have some sort of DDOS attack [4] protecting
 mechanism  which we activate?
 If it is true, how we can avoid this behaviour? What are the rules we
must
 follow?

 2. Our aim is to locate plagiarised code, so we sometimes query
Google Code Search with very big queries if the code part is big.
Is there some limit of query length?

3. In this implementation of regular expression generator in apache-
rat-pd
 we may make some mistake and sometimes we have false positive result
of our
unplagiarized code query. We ask Google Code Search to find a code
part using our
regular expression query. Sometimes the engine returns some results
which
partially matches our code.
Do you have some advice how to get only exact matches to avoid false
positives?

We use gdata-codesearch-2.0 library to communicate with google
codesearch
engine.

Example of query for simple HelloWorld.java :
Code:
public class HelloWorld {
    public static void main(String argv[]) {
      System.out.println("Hello World.");
    }
 }

 Query:
 
http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+main(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+System(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s?)+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=null


 [1] http://www.google.com/codesearch
 [2] http://www.koders.com/
 [3] http://www.krugle.com/
 [4]  http://en.wikipedia.org/wiki/Denial-of-service_attack

Best regards,
Marija

Reply via email to