[rat-cut&paste] engaging with search engine authors

Alexei Fedotov Wed, 08 Apr 2009 23:22:29 -0700

Folks, Marija,

I was thinking about proper reading. I believe it worth approaching
code search engine authors and asking them for advise. I think it is
more appropriate to approach Google first because they are known to be
open source friendly and because we are somehow competing with Krugle
- AFAIK, they have a commercial anti-plagiarism tool.


Thanks.

BTW. As for algorithm example, the sliding window is language
dependent. Thus, sliding becomes a method of a language:

while (language.slideToNextInterestingPiece(window)) {
 String content = window.content();
 if (heuristic.unique(content) > LIMIT) {





On Wed, Apr 8, 2009 at 10:17 PM, Alexei Fedotov
<[email protected]> wrote:
> Hello Marija,
> I'm impressed with your progress in learning things and prototyping a
> solution. As for your question about useful reading, there is a bunch
> of fun examples of Google Code Search usage here [1].
>
> I agree that there is no one best algorithm to approach the problem.
> My suggestion on this is to make it modular and parameterizable.
> Instead of removing common code we can just slide a window through the
> code, e.g. do something like that:
>
> IHeuristic heuristic = new DictionaryHeuristic(commonWordProbabilities);
> ILanguage language = new CPlusPlusLanguage();
>
> while (window.slide()) {
>  String content = window.content();
>  if (heuristic.unique(content) + language.independent(content) > LIMIT) {
>    ISearchResult result =
> searchEngine.search(language.replaceVarsWithWildcards(content),
> excludeUrls);
>    if (result != null) {
>      report.add(result);
>    }
>  }
> }
>
> Please, note that I'm not a fluent Java programmer, hence, you may
> discover a better way to express this algorithm. The main intention of
> this example is to show that the algorithm itself may be freed from
> language and dictionary specifics.
>
>
> BTW. Your English is ok, don't be sorry about it.
>
> [1] http://kottke.org/06/10/google-code-search
>
>
> 2009/4/8 Marija Šljivović <[email protected]>:
>> HI!
>>
>> Firstly, I apologise for my English.
>>
>> I make a small research for already known algorithms for
>> code-plagiarism detection There are several of them,
>> but I am afraid that neither one of them cannot be used in our case
>> because code for comparation is missing. I will search more about
>> that.
>>
>> Lets talk about things we know now and which I can implement.
>>
>> The greatest problem in this tool which have to be solved is that if
>> this tool is being run on large source-code, time for completely
>> processing of it will be very long. Because of that we have to make
>> logic which recognise "critical" code segments.
>>  We will try to develope our own heuristic which can recognize which
>> code is "good" for cut&paste, and then pass that code on examination
>> (of course, we must have an brute-force (will check everything) mode
>> in this tool).
>>
>> There are several ideas for heuristic which will recognise code which
>> is "good for copy and paste". As I am suggested, which is good and
>> easy to implement is: If we recognise a comment in source code which
>> is "uncommon", we can check it on search engines. Identical non-common
>> comment will be great indication of possible plagiarised code.
>>  In addition, identical comment with even small amount of "identical"
>> source code associated with them is the sign of cut&pasted code too.
>>
>> If we can somehow recognize "independent" code chunks (almost all
>> variables are in this part of code and this part of code works only
>> with this variables and returns or sets single result), like
>> functions, this code will be sent to be checked because code with
>> these properties is more often plagiarised.
>>
>> Instead of this we can do it in opposite way by removing code which is
>> common - for example in Java we use setProperty() and getProperty()
>> methods very often. If we exclude this code of our code for
>> comparation, we can send all rest to be checked.
>>
>> Code "good" for copying is very complex code with little number of
>> input variables (searching functions, sorting functions...)
>>
>> I think that just one thing is sure - there is no generic, universal
>> algorithm with which we will get good results... Well-documented
>> combination of rules that defines common copied code in combination
>> with brute-force approach can get best results.
>>
>>
>>
>> If anybody has any idea or suggestion about this tool, it will be
>> great to inform us! We together can define which functionalities will
>> have this plagiariem checker tool.
>>
>> My very basic prototype of tool is here 
>> http://b0ss.on.neobee.net/MinuliRad/RAT/
>>
>> For the end...in my prototype of this tool I use one interface -
>> ISearchEngine and all parsers for search engines must implement it. If
>> we can provide Class name of current ISearchEngine implementation to
>> main program (by argument for example), we can instantiate object of
>> that class thought reflection. Why? In that way we can separate
>> parsers from logic and parsers will be "plug ins" - I will make parser
>> for krugle.org and place it in classpath of this tool, run tool with
>> ClassName of parser plug in tool arguments and have my own
>> implementation of krugle.org code search. Somebody else can provide
>> another implementation..., all plugins will be simple .jar files in
>> RAT classpath...
>>
>>
>> Best regards,
>>
>> Marija Sljivovic
>>
>
>
>
> --
> With best regards / с наилучшими пожеланиями,
> Alexei Fedotov / Алексей Федотов,
> http://www.telecom-express.ru/
> http://people.apache.org/~aaf/
> http://harmony.apache.org/
> http://code.google.com/p/openmeetings/
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/
http://harmony.apache.org/
http://code.google.com/p/openmeetings/

[rat-cut&paste] engaging with search engine authors

Reply via email to