After research I found out that it is not enough only to ask google code search is there similar part of code. False positive matching are very often so I think to process whole code using links from results provided by gdata-codesearch api. When I potentially matching code from one of results, I will locally do more analyses to determine is it really same code part. Interesting is that gdata-codesearch api do not provide ability to download source file linked by codesearch result [1] Anyway, it is possible to do that using some third part libraries, but it is not elegant solution.
Best regards, Marija [1] http://groups.google.com/group/Google-Code-Search/browse_thread/thread/e93c701fac029a67/d7f6a97b72838e12?hl=en&lnk=gst&q=download#d7f6a97b72838e12 On Jun 24, 11:57 pm, maka82 <[email protected]> wrote: > Hi. > I am working on my project: apache-rat-pd. > Apache RAT plagiarism detector is a command-line tool for searching > the code > base for possibly plagiarized code using web code search engines. > This project is a part of Google Summer of Code 2009. It is mentored > by Apache. > > The idea is to query code search engines(like Google Code Search [1], > Koders [2] or Krugle [3]) > to check if the code we send in the query is copied from somewhere. > More info about project can be found athttp://code.google.com/p/apache-rat-pd/ > > Our initial plan was to make it to work with Google Code Search first > because it is open for developers, it has custom libraries, and has a > great support for searching by regular expressions. > So far, I created an initial version of this tool. It queries a part > of code we assume to be plagiarized. We faced some problems and need > your help to resolve them. > > Sometimes, when we query Google Code Search with a great number of > queries in > small time amount, the engine starts rejecting our queries. > > So we have some questions: > > 1. Do Google Code Search have some sort of DDOS attack [4] protecting > mechanism which we activate? > If it is true, how we can avoid this behaviour? What are the rules we > must > follow? > > 2. Our aim is to locate plagiarised code, so we sometimes query > Google Code Search with very big queries if the code part is big. > Is there some limit of query length? > > 3. In this implementation of regular expression generator in apache- > rat-pd > we may make some mistake and sometimes we have false positive result > of our > unplagiarized code query. We ask Google Code Search to find a code > part using our > regular expression query. Sometimes the engine returns some results > which > partially matches our code. > Do you have some advice how to get only exact matches to avoid false > positives? > > We use gdata-codesearch-2.0 library to communicate with google > codesearch > engine. > > Example of query for simple HelloWorld.java : > Code: > public class HelloWorld { > public static void main(String argv[]) { > System.out.println("Hello World."); > } > } > > Query: > > http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+main(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+System(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s?)+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=null > > [1]http://www.google.com/codesearch > [2]http://www.koders.com/ > [3]http://www.krugle.com/ > [4] http://en.wikipedia.org/wiki/Denial-of-service_attack > > Best regards, > Marija
