Re: [rat-cut&paste] engaging with search engine authors

Alexei Fedotov Thu, 09 Apr 2009 09:48:41 -0700

Rat folks,
I've got one more question. Can you suggest anything from RAT scanning
engine which can be reused in scanning sources for cut & paste?


Thanks!


On Thu, Apr 9, 2009 at 8:37 PM, Alexei Fedotov <[email protected]> wrote:
> Robert, all,
> Do I understand correctly that no ACQ [3] analogues should be signed
> by Marija to contribute to RAT? Is it true that having ICLA for
> Harmony I can start committing her code as a separate RAT module
> provided we obey to accepted conventions and Marija checked an ASF
> checkbox for her code?
>
> [3] http://harmony.apache.org/auth_cont_quest.html
>
>
>
>
>
> 2009/4/9 Marija Šljivović <[email protected]>:
>> Hi!
>>
>> Firstly thanks for ideas!
>>
>> When I read this mails I saw a great idea - all heuristic algorithms
>> can return int value(not only boolean - good to be checked, not good
>> to be checked). I have written pseudocode where I use this idea. In
>> that way we don't need to make two algorithms - for bruteforce and for
>> heuristic check. In addition we can set how aggressive will tool work
>> by seting LIMIT.
>> If we use booleans for checkByHueristic we can not be enough accurate
>> when to use searchEngine.
>> When we use integers and sum it like in bottom algorithm we can
>> measure how it is important to check code in searchEngine.
>> Benefits of this approach:
>> We do not have to change core sliding window algorithm.
>> We can later add more IHueristicChecker changing algorithm.
>> We can choose which algorithms we will use in specific situations too.
>>
>> //all hueristic checkers must implements this interface
>> interface IHueristicChecker{
>>      /**
>>     * @param codeToBeChecked
>>     * @return int value from 1 to 100
>>     */
>>     int checkByHueristic(String codeToBeChecked);
>>
>> }
>>
>> /**
>> * Limit is user-defined constants 1-100 and it is used to
>> * measure how aggressively will be used search engines
>> * if we put there 10 =little usage of internet (search engines)
>> * if we put there 90 =almost bruteforce checking will be run
>> */
>> int LIMIT;
>>
>> //we create rules which we wish to use
>> //this rules can be easy configured by user
>> //by parameters or...
>> IHueristicChecker commentsHueristicChecker = new
>> CommentsHueristicChecker(programingLanguage...);
>> IHueristicChecker missspelingsHueristicChecker = new
>> MisspelingsHueristicChecker(commonLanguagePhrases...);
>> IHueristicChecker goodFunctionToBeCopiedHueristicChecker = new
>> GoodFunctionToBeCopiedHueristicChecker(params...);
>>
>> //we add all rules we ar interested in
>> List<IHueristicChecker> algorithmsForChecking = new
>> ArrayList<IHueristicChecker>();
>> algorithmsForChecking.add(commentsHueristicChecker);
>> algorithmsForChecking.add(missspelingsHueristicChecker);
>> algorithmsForChecking.add(bruteForceChecker);
>>
>>
>> ...
>> //then in sliding window algorithm we use this checkers in part of
>> algorihm where we
>> //know current window content
>>   int sum = 0;
>>   for(IHueristicChecker checker : algorithmsForChecking)
>>    {
>>        sum += checker.checkByHueristic(codeToBeChecked);
>>    }
>>
>> if (sum/algorithmsForChecking.size() < LIMIT)
>> {
>>     //we think that heuristic show us that this code must be checked
>>     searchEngine.checkCodeOnInternet(codeToBeChecked);
>>
>> }
>> ...
>>
>> P.S.
>> I updated my plans on http://b0ss.on.neobee.net/MinuliRad/RAT/ ...
>>
>>
>>
>>
>> On Thu, Apr 9, 2009 at 8:21 AM, Alexei Fedotov <[email protected]> 
>> wrote:
>>>
>>> Folks, Marija,
>>>
>>> I was thinking about proper reading. I believe it worth approaching
>>> code search engine authors and asking them for advise. I think it is
>>> more appropriate to approach Google first because they are known to be
>>> open source friendly and because we are somehow competing with Krugle
>>> - AFAIK, they have a commercial anti-plagiarism tool.
>>>
>>> Thanks.
>>>
>>> BTW. As for algorithm example, the sliding window is language
>>> dependent. Thus, sliding becomes a method of a language:
>>>
>>> while (language.slideToNextInterestingPiece(window)) {
>>>  String content = window.content();
>>>  if (heuristic.unique(content) > LIMIT) {
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Apr 8, 2009 at 10:17 PM, Alexei Fedotov
>>> <[email protected]> wrote:
>>> > Hello Marija,
>>> > I'm impressed with your progress in learning things and prototyping a
>>> > solution. As for your question about useful reading, there is a bunch
>>> > of fun examples of Google Code Search usage here [1].
>>> >
>>> > I agree that there is no one best algorithm to approach the problem.
>>> > My suggestion on this is to make it modular and parameterizable.
>>> > Instead of removing common code we can just slide a window through the
>>> > code, e.g. do something like that:
>>> >
>>> > IHeuristic heuristic = new DictionaryHeuristic(commonWordProbabilities);
>>> > ILanguage language = new CPlusPlusLanguage();
>>> >
>>> > while (window.slide()) {
>>> >  String content = window.content();
>>> >  if (heuristic.unique(content) + language.independent(content) > LIMIT) {
>>> >    ISearchResult result =
>>> > searchEngine.search(language.replaceVarsWithWildcards(content),
>>> > excludeUrls);
>>> >    if (result != null) {
>>> >      report.add(result);
>>> >    }
>>> >  }
>>> > }
>>> >
>>> > Please, note that I'm not a fluent Java programmer, hence, you may
>>> > discover a better way to express this algorithm. The main intention of
>>> > this example is to show that the algorithm itself may be freed from
>>> > language and dictionary specifics.
>>> >
>>> >
>>> > BTW. Your English is ok, don't be sorry about it.
>>> >
>>> > [1] http://kottke.org/06/10/google-code-search
>>> >
>>> >
>>> > 2009/4/8 Marija Šljivović <[email protected]>:
>>> >> HI!
>>> >>
>>> >> Firstly, I apologise for my English.
>>> >>
>>> >> I make a small research for already known algorithms for
>>> >> code-plagiarism detection There are several of them,
>>> >> but I am afraid that neither one of them cannot be used in our case
>>> >> because code for comparation is missing. I will search more about
>>> >> that.
>>> >>
>>> >> Lets talk about things we know now and which I can implement.
>>> >>
>>> >> The greatest problem in this tool which have to be solved is that if
>>> >> this tool is being run on large source-code, time for completely
>>> >> processing of it will be very long. Because of that we have to make
>>> >> logic which recognise "critical" code segments.
>>> >>  We will try to develope our own heuristic which can recognize which
>>> >> code is "good" for cut&paste, and then pass that code on examination
>>> >> (of course, we must have an brute-force (will check everything) mode
>>> >> in this tool).
>>> >>
>>> >> There are several ideas for heuristic which will recognise code which
>>> >> is "good for copy and paste". As I am suggested, which is good and
>>> >> easy to implement is: If we recognise a comment in source code which
>>> >> is "uncommon", we can check it on search engines. Identical non-common
>>> >> comment will be great indication of possible plagiarised code.
>>> >>  In addition, identical comment with even small amount of "identical"
>>> >> source code associated with them is the sign of cut&pasted code too.
>>> >>
>>> >> If we can somehow recognize "independent" code chunks (almost all
>>> >> variables are in this part of code and this part of code works only
>>> >> with this variables and returns or sets single result), like
>>> >> functions, this code will be sent to be checked because code with
>>> >> these properties is more often plagiarised.
>>> >>
>>> >> Instead of this we can do it in opposite way by removing code which is
>>> >> common - for example in Java we use setProperty() and getProperty()
>>> >> methods very often. If we exclude this code of our code for
>>> >> comparation, we can send all rest to be checked.
>>> >>
>>> >> Code "good" for copying is very complex code with little number of
>>> >> input variables (searching functions, sorting functions...)
>>> >>
>>> >> I think that just one thing is sure - there is no generic, universal
>>> >> algorithm with which we will get good results... Well-documented
>>> >> combination of rules that defines common copied code in combination
>>> >> with brute-force approach can get best results.
>>> >>
>>> >>
>>> >>
>>> >> If anybody has any idea or suggestion about this tool, it will be
>>> >> great to inform us! We together can define which functionalities will
>>> >> have this plagiariem checker tool.
>>> >>
>>> >> My very basic prototype of tool is here 
>>> >> http://b0ss.on.neobee.net/MinuliRad/RAT/
>>> >>
>>> >> For the end...in my prototype of this tool I use one interface -
>>> >> ISearchEngine and all parsers for search engines must implement it. If
>>> >> we can provide Class name of current ISearchEngine implementation to
>>> >> main program (by argument for example), we can instantiate object of
>>> >> that class thought reflection. Why? In that way we can separate
>>> >> parsers from logic and parsers will be "plug ins" - I will make parser
>>> >> for krugle.org and place it in classpath of this tool, run tool with
>>> >> ClassName of parser plug in tool arguments and have my own
>>> >> implementation of krugle.org code search. Somebody else can provide
>>> >> another implementation..., all plugins will be simple .jar files in
>>> >> RAT classpath...
>>> >>
>>> >>
>>> >> Best regards,
>>> >>
>>> >> Marija Sljivovic
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > With best regards / с наилучшими пожеланиями,
>>> > Alexei Fedotov / Алексей Федотов,
>>> > http://www.telecom-express.ru/
>>> > http://people.apache.org/~aaf/
>>> > http://harmony.apache.org/
>>> > http://code.google.com/p/openmeetings/
>>> >
>>>
>>>
>>>
>>> --
>>> With best regards / с наилучшими пожеланиями,
>>> Alexei Fedotov / Алексей Федотов,
>>> http://www.telecom-express.ru/
>>> http://people.apache.org/~aaf/
>>> http://harmony.apache.org/
>>> http://code.google.com/p/openmeetings/
>>
>
>
>
> --
> With best regards / с наилучшими пожеланиями,
> Alexei Fedotov / Алексей Федотов,
> http://www.telecom-express.ru/
> http://people.apache.org/~aaf/
> http://harmony.apache.org/
> http://code.google.com/p/openmeetings/
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/
http://harmony.apache.org/
http://code.google.com/p/openmeetings/

Re: [rat-cut&paste] engaging with search engine authors

Reply via email to