Hi. I am working on my project: apache-rat-pd. Apache RAT plagiarism detector is a command-line tool for searching the code base for possibly plagiarized code using web code search engines. This project is a part of Google Summer of Code 2009. It is mentored by Apache.
The idea is to query code search engines(like Google Code Search [1], Koders [2] or Krugle [3]) to check if the code we send in the query is copied from somewhere. More info about project can be found at http://code.google.com/p/apache-rat-pd/ Our initial plan was to make it to work with Google Code Search first because it is open for developers, it has custom libraries, and has a great support for searching by regular expressions. So far, I created an initial version of this tool. It queries a part of code we assume to be plagiarized. We faced some problems and need your help to resolve them. Sometimes, when we query Google Code Search with a great number of queries in small time amount, the engine starts rejecting our queries. So we have some questions: 1. Do Google Code Search have some sort of DDOS attack [4] protecting mechanism which we activate? If it is true, how we can avoid this behaviour? What are the rules we must follow? 2. Our aim is to locate plagiarised code, so we sometimes query Google Code Search with very big queries if the code part is big. Is there some limit of query length? 3. In this implementation of regular expression generator in apache- rat-pd we may make some mistake and sometimes we have false positive result of our unplagiarized code query. We ask Google Code Search to find a code part using our regular expression query. Sometimes the engine returns some results which partially matches our code. Do you have some advice how to get only exact matches to avoid false positives? We use gdata-codesearch-2.0 library to communicate with google codesearch engine. Example of query for simple HelloWorld.java : Code: public class HelloWorld { public static void main(String argv[]) { System.out.println("Hello World."); } } Query: http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+main(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+System(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s?)+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=null [1] http://www.google.com/codesearch [2] http://www.koders.com/ [3] http://www.krugle.com/ [4] http://en.wikipedia.org/wiki/Denial-of-service_attack Best regards, Marija
