Great news! I join Marija in thanking Ben!
On Sat, Jul 4, 2009 at 3:33 AM, maka82<[email protected]> wrote: > Hi! > > I got some answers and I will write it here for everyone > who is interested. > > 1. There is protection mechanism against DoS attacks in Google Code > Search service. > Increasing wait time between each two queries should not activate > it. > > 2. There is the query limit length and it is now 1024 characters. > > 3. Problem with previous query is that Code Search does not look for > multi line matches. > The fact that it returns results at all is because there are > spaces in query. > So each atom is matched in the file, but not necessarily on the > same line. > To get a better result all spaces should be escaped , e.g. query > something like this: > > ^\s*public\s*class\s*HelloWorld\s*\{\s$ > ^\s*public\s*static\s*void\s*main\s*\(\s*String\s*argv\[\]\)\s*\{\s > $ > etc. > > That way it is sure at least that every complete line is matched. But > whether the lines are next to each other, Code Search cannot tell. > > Unfortunately, it is still not possible to get at the raw file > content with Code Search. > > I would like to thank Ben for this information. :) > > Best regards, > Marija > > > On Jul 3, 10:21 am, maka82 <[email protected]> wrote: >> After research I found out that it is not enough only to ask google >> code search is there similar part of code. False positive matching are >> very often so I think to process whole code using links from results >> provided by gdata-codesearch api. When I potentially matching code >> from one of results, I will locally do more analyses to determine is >> it really same code part. >> Interesting is that gdata-codesearch api do not provide ability to >> download source file linked by codesearch result [1] >> Anyway, it is possible to do that using some third part libraries, but >> it is not elegant solution. >> >> Best regards, >> Marija >> >> [1]http://groups.google.com/group/Google-Code-Search/browse_thread/threa... >> >> On Jun 24, 11:57 pm, maka82 <[email protected]> wrote: >> >> >> >> > Hi. >> > I am working on my project: apache-rat-pd. >> > Apache RAT plagiarism detector is a command-line tool for searching >> > the code >> > base for possibly plagiarized code using web code search engines. >> > This project is a part of Google Summer of Code 2009. It is mentored >> > by Apache. >> >> > The idea is to query code search engines(like Google Code Search [1], >> > Koders [2] or Krugle [3]) >> > to check if the code we send in the query is copied from somewhere. >> > More info about project can be found >> > athttp://code.google.com/p/apache-rat-pd/ >> >> > Our initial plan was to make it to work with Google Code Search first >> > because it is open for developers, it has custom libraries, and has a >> > great support for searching by regular expressions. >> > So far, I created an initial version of this tool. It queries a part >> > of code we assume to be plagiarized. We faced some problems and need >> > your help to resolve them. >> >> > Sometimes, when we query Google Code Search with a great number of >> > queries in >> > small time amount, the engine starts rejecting our queries. >> >> > So we have some questions: >> >> > 1. Do Google Code Search have some sort of DDOS attack [4] protecting >> > mechanism which we activate? >> > If it is true, how we can avoid this behaviour? What are the rules we >> > must >> > follow? >> >> > 2. Our aim is to locate plagiarised code, so we sometimes query >> > Google Code Search with very big queries if the code part is big. >> > Is there some limit of query length? >> >> > 3. In this implementation of regular expression generator in apache- >> > rat-pd >> > we may make some mistake and sometimes we have false positive result >> > of our >> > unplagiarized code query. We ask Google Code Search to find a code >> > part using our >> > regular expression query. Sometimes the engine returns some results >> > which >> > partially matches our code. >> > Do you have some advice how to get only exact matches to avoid false >> > positives? >> >> > We use gdata-codesearch-2.0 library to communicate with google >> > codesearch >> > engine. >> >> > Example of query for simple HelloWorld.java : >> > Code: >> > public class HelloWorld { >> > public static void main(String argv[]) { >> > System.out.println("Hello World."); >> > } >> > } >> >> > Query: >> >> >http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+ma--in(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+Syst-e-m(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s-?)-+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=n-ull >> >> > [1]http://www.google.com/codesearch >> > [2]http://www.koders.com/ >> > [3]http://www.krugle.com/ >> > [4] http://en.wikipedia.org/wiki/Denial-of-service_attack >> >> > Best regards, >> > Marija > -- With best regards / с наилучшими пожеланиями, Alexei Fedotov / Алексей Федотов, http://www.telecom-express.ru/ http://harmony.apache.org/ http://code.google.com/p/openmeetings/
