Hi!
I got some answers and I will write it here for everyone
who is interested.
1. There is protection mechanism against DoS attacks in Google Code
Search service.
Increasing wait time between each two queries should not activate
it.
2. There is the query limit length and it is now 1024 characters.
3. Problem with previous query is that Code Search does not look for
multi line matches.
The fact that it returns results at all is because there are
spaces in query.
So each atom is matched in the file, but not necessarily on the
same line.
To get a better result all spaces should be escaped , e.g. query
something like this:
^\s*public\s*class\s*HelloWorld\s*\{\s$
^\s*public\s*static\s*void\s*main\s*\(\s*String\s*argv\[\]\)\s*\{\s
$
etc.
That way it is sure at least that every complete line is matched. But
whether the lines are next to each other, Code Search cannot tell.
Unfortunately, it is still not possible to get at the raw file
content with Code Search.
I would like to thank Ben for this information. :)
Best regards,
Marija
On Jul 3, 10:21 am, maka82 <[email protected]> wrote:
> After research I found out that it is not enough only to ask google
> code search is there similar part of code. False positive matching are
> very often so I think to process whole code using links from results
> provided by gdata-codesearch api. When I potentially matching code
> from one of results, I will locally do more analyses to determine is
> it really same code part.
> Interesting is that gdata-codesearch api do not provide ability to
> download source file linked by codesearch result [1]
> Anyway, it is possible to do that using some third part libraries, but
> it is not elegant solution.
>
> Best regards,
> Marija
>
> [1]http://groups.google.com/group/Google-Code-Search/browse_thread/threa...
>
> On Jun 24, 11:57 pm, maka82 <[email protected]> wrote:
>
>
>
> > Hi.
> > I am working on my project: apache-rat-pd.
> > Apache RAT plagiarism detector is a command-line tool for searching
> > the code
> > base for possibly plagiarized code using web code search engines.
> > This project is a part of Google Summer of Code 2009. It is mentored
> > by Apache.
>
> > The idea is to query code search engines(like Google Code Search [1],
> > Koders [2] or Krugle [3])
> > to check if the code we send in the query is copied from somewhere.
> > More info about project can be found
> > athttp://code.google.com/p/apache-rat-pd/
>
> > Our initial plan was to make it to work with Google Code Search first
> > because it is open for developers, it has custom libraries, and has a
> > great support for searching by regular expressions.
> > So far, I created an initial version of this tool. It queries a part
> > of code we assume to be plagiarized. We faced some problems and need
> > your help to resolve them.
>
> > Sometimes, when we query Google Code Search with a great number of
> > queries in
> > small time amount, the engine starts rejecting our queries.
>
> > So we have some questions:
>
> > 1. Do Google Code Search have some sort of DDOS attack [4] protecting
> > mechanism which we activate?
> > If it is true, how we can avoid this behaviour? What are the rules we
> > must
> > follow?
>
> > 2. Our aim is to locate plagiarised code, so we sometimes query
> > Google Code Search with very big queries if the code part is big.
> > Is there some limit of query length?
>
> > 3. In this implementation of regular expression generator in apache-
> > rat-pd
> > we may make some mistake and sometimes we have false positive result
> > of our
> > unplagiarized code query. We ask Google Code Search to find a code
> > part using our
> > regular expression query. Sometimes the engine returns some results
> > which
> > partially matches our code.
> > Do you have some advice how to get only exact matches to avoid false
> > positives?
>
> > We use gdata-codesearch-2.0 library to communicate with google
> > codesearch
> > engine.
>
> > Example of query for simple HelloWorld.java :
> > Code:
> > public class HelloWorld {
> > public static void main(String argv[]) {
> > System.out.println("Hello World.");
> > }
> > }
>
> > Query:
>
> >http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+main(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+System(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s?)+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=null
>
> > [1]http://www.google.com/codesearch
> > [2]http://www.koders.com/
> > [3]http://www.krugle.com/
> > [4] http://en.wikipedia.org/wiki/Denial-of-service_attack
>
> > Best regards,
> > Marija