Great news! I join Marija in thanking Ben!


On Sat, Jul 4, 2009 at 3:33 AM, maka82<[email protected]> wrote:
> Hi!
>
> I got some answers and I will write it here for everyone
> who is interested.
>
> 1. There is protection mechanism against DoS attacks in Google Code
> Search service.
>   Increasing wait time between each two queries should not activate
> it.
>
> 2.  There is the query limit length and it is now 1024 characters.
>
> 3. Problem with previous query is that Code Search does not look for
> multi line matches.
>    The fact that it returns results at all is because there are
> spaces in query.
>    So each atom is matched in the file, but not necessarily on the
> same line.
>    To get a better result all spaces should be escaped , e.g. query
> something like this:
>
>    ^\s*public\s*class\s*HelloWorld\s*\{\s$
>    ^\s*public\s*static\s*void\s*main\s*\(\s*String\s*argv\[\]\)\s*\{\s
> $
>    etc.
>
>  That way it is sure at least that every complete line is matched. But
> whether the lines are next to each other, Code Search cannot tell.
>
>  Unfortunately, it is still not possible to get at the raw file
> content with Code Search.
>
>  I would like to thank Ben for this information. :)
>
> Best regards,
> Marija
>
>
> On Jul 3, 10:21 am, maka82 <[email protected]> wrote:
>> After research I found out that it is not enough only to ask google
>> code search is there similar part of code. False positive matching are
>> very often so I think to process whole code using links from results
>> provided by gdata-codesearch api. When I potentially matching code
>> from one of results, I will locally do more analyses to determine is
>> it really same code part.
>> Interesting is that gdata-codesearch api do not provide ability to
>> download source file linked by codesearch result [1]
>> Anyway, it is possible to do that using some third part libraries, but
>> it is not elegant solution.
>>
>> Best regards,
>> Marija
>>
>> [1]http://groups.google.com/group/Google-Code-Search/browse_thread/threa...
>>
>> On Jun 24, 11:57 pm, maka82 <[email protected]> wrote:
>>
>>
>>
>> > Hi.
>> > I am working on my project: apache-rat-pd.
>> > Apache RAT plagiarism detector is a command-line tool for searching
>> > the code
>> > base for possibly plagiarized code using web code search engines.
>> > This project is a part of Google Summer of Code 2009. It is mentored
>> > by Apache.
>>
>> > The idea is to query code search engines(like Google Code Search [1],
>> > Koders [2] or Krugle [3])
>> > to check if the code we send in the query is copied from somewhere.
>> > More info about project can be found 
>> > athttp://code.google.com/p/apache-rat-pd/
>>
>> > Our initial plan was to make it to work with Google Code Search first
>> > because it is open for developers, it has custom libraries, and has a
>> > great support for searching by regular expressions.
>> >  So far, I created an initial version of this tool. It queries a part
>> > of code we assume to be plagiarized. We faced some problems and need
>> > your help to resolve them.
>>
>> > Sometimes, when we query Google Code Search with a great number of
>> > queries in
>> > small time amount, the engine starts rejecting our queries.
>>
>> > So we have some questions:
>>
>> >  1. Do Google Code Search have some sort of DDOS attack [4] protecting
>> >  mechanism  which we activate?
>> >  If it is true, how we can avoid this behaviour? What are the rules we
>> > must
>> >  follow?
>>
>> >  2. Our aim is to locate plagiarised code, so we sometimes query
>> > Google Code Search with very big queries if the code part is big.
>> > Is there some limit of query length?
>>
>> > 3. In this implementation of regular expression generator in apache-
>> > rat-pd
>> >  we may make some mistake and sometimes we have false positive result
>> > of our
>> > unplagiarized code query. We ask Google Code Search to find a code
>> > part using our
>> > regular expression query. Sometimes the engine returns some results
>> > which
>> > partially matches our code.
>> > Do you have some advice how to get only exact matches to avoid false
>> > positives?
>>
>> > We use gdata-codesearch-2.0 library to communicate with google
>> > codesearch
>> > engine.
>>
>> > Example of query for simple HelloWorld.java :
>> > Code:
>> > public class HelloWorld {
>> >     public static void main(String argv[]) {
>> >       System.out.println("Hello World.");
>> >     }
>> >  }
>>
>> >  Query:
>>
>> >http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+ma--in(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+Syst-e-m(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s-?)-+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=n-ull
>>
>> >  [1]http://www.google.com/codesearch
>> >  [2]http://www.koders.com/
>> >  [3]http://www.krugle.com/
>> >  [4]  http://en.wikipedia.org/wiki/Denial-of-service_attack
>>
>> > Best regards,
>> > Marija
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://harmony.apache.org/
http://code.google.com/p/openmeetings/

Reply via email to