After research I found out that it is not enough only to ask google
code search is there similar part of code. False positive matching are
very often so I think to process whole code using links from results
provided by gdata-codesearch api. When I potentially matching code
from one of results, I will locally do more analyses to determine is
it really same code part.
Interesting is that gdata-codesearch api do not provide ability to
download source file linked by codesearch result [1]
Anyway, it is possible to do that using some third part libraries, but
it is not elegant solution.

Best regards,
Marija

[1]
http://groups.google.com/group/Google-Code-Search/browse_thread/thread/e93c701fac029a67/d7f6a97b72838e12?hl=en&lnk=gst&q=download#d7f6a97b72838e12


On Jun 24, 11:57 pm, maka82 <[email protected]> wrote:
> Hi.
> I am working on my project: apache-rat-pd.
> Apache RAT plagiarism detector is a command-line tool for searching
> the code
> base for possibly plagiarized code using web code search engines.
> This project is a part of Google Summer of Code 2009. It is mentored
> by Apache.
>
> The idea is to query code search engines(like Google Code Search [1],
> Koders [2] or Krugle [3])
> to check if the code we send in the query is copied from somewhere.
> More info about project can be found athttp://code.google.com/p/apache-rat-pd/
>
> Our initial plan was to make it to work with Google Code Search first
> because it is open for developers, it has custom libraries, and has a
> great support for searching by regular expressions.
>  So far, I created an initial version of this tool. It queries a part
> of code we assume to be plagiarized. We faced some problems and need
> your help to resolve them.
>
> Sometimes, when we query Google Code Search with a great number of
> queries in
> small time amount, the engine starts rejecting our queries.
>
> So we have some questions:
>
>  1. Do Google Code Search have some sort of DDOS attack [4] protecting
>  mechanism  which we activate?
>  If it is true, how we can avoid this behaviour? What are the rules we
> must
>  follow?
>
>  2. Our aim is to locate plagiarised code, so we sometimes query
> Google Code Search with very big queries if the code part is big.
> Is there some limit of query length?
>
> 3. In this implementation of regular expression generator in apache-
> rat-pd
>  we may make some mistake and sometimes we have false positive result
> of our
> unplagiarized code query. We ask Google Code Search to find a code
> part using our
> regular expression query. Sometimes the engine returns some results
> which
> partially matches our code.
> Do you have some advice how to get only exact matches to avoid false
> positives?
>
> We use gdata-codesearch-2.0 library to communicate with google
> codesearch
> engine.
>
> Example of query for simple HelloWorld.java :
> Code:
> public class HelloWorld {
>     public static void main(String argv[]) {
>       System.out.println("Hello World.");
>     }
>  }
>
>  Query:
>
> http://www.google.com/codesearch/feeds/search?q=public(\s?)+class(\s?)+HelloWorld(\s?)+\{(\s?)+public(\s?)+static(\s?)+void(\s?)+ma­in(\s?)+\((\s?)+String(\s?)+argv(\s?)+\[(\s?)+\](\s?)+\)(\s?)+\{(\s?)+Syste­m(\s?)+\.(\s?)+out(\s?)+\.(\s?)+println(\s?)+\((\s?)+"Hello(\s?)+World(\s?)­+\.(\s?)+"(\s?)+\)(\s?)+;(\s?)+\}(\s?)+\}&start-index=null&max-results=null
>
>  [1]http://www.google.com/codesearch
>  [2]http://www.koders.com/
>  [3]http://www.krugle.com/
>  [4]  http://en.wikipedia.org/wiki/Denial-of-service_attack
>
> Best regards,
> Marija

Reply via email to