Re: [rat-cut&paste] engaging with search engine authors

Marija Šljivović Mon, 13 Apr 2009 13:03:39 -0700

Hi!

I made a page about this project on Apache Wiki:
http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal
I plan to post there everything about this project.


About return results type which will be returned from each heuristic algorithm:
We can use integers(1-100 for example) to measure how each algorithm
is sure about need of checking code on searchEngine, or we can use
boolean as returned type.
If we use integers we can easy define how aggressive will be used
searchEngine-s(by program parameter for example). But natural place of
this  integer value for defining if this tool will use more or less
searchEngine resources is in specific heuristicAlgorithm
implementation. We can only set it from outside.
So, I think that Alexei is correct. Booleans will be better choice if
we move some logic in IHueristicChecker Implementations.

All the best!
Marija




On Thu, Apr 9, 2009 at 6:48 PM, Alexei Fedotov <[email protected]> wrote:
>
> Rat folks,
> I've got one more question. Can you suggest anything from RAT scanning
> engine which can be reused in scanning sources for cut & paste?
>
> Thanks!
>
>
> On Thu, Apr 9, 2009 at 8:37 PM, Alexei Fedotov <[email protected]> 
> wrote:
> > Robert, all,
> > Do I understand correctly that no ACQ [3] analogues should be signed
> > by Marija to contribute to RAT? Is it true that having ICLA for
> > Harmony I can start committing her code as a separate RAT module
> > provided we obey to accepted conventions and Marija checked an ASF
> > checkbox for her code?
> >
> > [3] http://harmony.apache.org/auth_cont_quest.html
> >
> >
> >
> >
> >
> > 2009/4/9 Marija Šljivović <[email protected]>:
> >> Hi!
> >>
> >> Firstly thanks for ideas!
> >>
> >> When I read this mails I saw a great idea - all heuristic algorithms
> >> can return int value(not only boolean - good to be checked, not good
> >> to be checked). I have written pseudocode where I use this idea. In
> >> that way we don't need to make two algorithms - for bruteforce and for
> >> heuristic check. In addition we can set how aggressive will tool work
> >> by seting LIMIT.
> >> If we use booleans for checkByHueristic we can not be enough accurate
> >> when to use searchEngine.
> >> When we use integers and sum it like in bottom algorithm we can
> >> measure how it is important to check code in searchEngine.
> >> Benefits of this approach:
> >> We do not have to change core sliding window algorithm.
> >> We can later add more IHueristicChecker changing algorithm.
> >> We can choose which algorithms we will use in specific situations too.
> >>
> >> //all hueristic checkers must implements this interface
> >> interface IHueristicChecker{
> >>      /**
> >>     * @param codeToBeChecked
> >>     * @return int value from 1 to 100
> >>     */
> >>     int checkByHueristic(String codeToBeChecked);
> >>
> >> }
> >>
> >> /**
> >> * Limit is user-defined constants 1-100 and it is used to
> >> * measure how aggressively will be used search engines
> >> * if we put there 10 =little usage of internet (search engines)
> >> * if we put there 90 =almost bruteforce checking will be run
> >> */
> >> int LIMIT;
> >>
> >> //we create rules which we wish to use
> >> //this rules can be easy configured by user
> >> //by parameters or...
> >> IHueristicChecker commentsHueristicChecker = new
> >> CommentsHueristicChecker(programingLanguage...);
> >> IHueristicChecker missspelingsHueristicChecker = new
> >> MisspelingsHueristicChecker(commonLanguagePhrases...);
> >> IHueristicChecker goodFunctionToBeCopiedHueristicChecker = new
> >> GoodFunctionToBeCopiedHueristicChecker(params...);
> >>
> >> //we add all rules we ar interested in
> >> List<IHueristicChecker> algorithmsForChecking = new
> >> ArrayList<IHueristicChecker>();
> >> algorithmsForChecking.add(commentsHueristicChecker);
> >> algorithmsForChecking.add(missspelingsHueristicChecker);
> >> algorithmsForChecking.add(bruteForceChecker);
> >>
> >>
> >> ...
> >> //then in sliding window algorithm we use this checkers in part of
> >> algorihm where we
> >> //know current window content
> >>   int sum = 0;
> >>   for(IHueristicChecker checker : algorithmsForChecking)
> >>    {
> >>        sum += checker.checkByHueristic(codeToBeChecked);
> >>    }
> >>
> >> if (sum/algorithmsForChecking.size() < LIMIT)
> >> {
> >>     //we think that heuristic show us that this code must be checked
> >>     searchEngine.checkCodeOnInternet(codeToBeChecked);
> >>
> >> }
> >> ...
> >>
> >> P.S.
> >> I updated my plans on http://b0ss.on.neobee.net/MinuliRad/RAT/ ...
> >>
> >>
> >>
> >>
> >> On Thu, Apr 9, 2009 at 8:21 AM, Alexei Fedotov <[email protected]> 
> >> wrote:
> >>>
> >>> Folks, Marija,
> >>>
> >>> I was thinking about proper reading. I believe it worth approaching
> >>> code search engine authors and asking them for advise. I think it is
> >>> more appropriate to approach Google first because they are known to be
> >>> open source friendly and because we are somehow competing with Krugle
> >>> - AFAIK, they have a commercial anti-plagiarism tool.
> >>>
> >>> Thanks.
> >>>
> >>> BTW. As for algorithm example, the sliding window is language
> >>> dependent. Thus, sliding becomes a method of a language:
> >>>
> >>> while (language.slideToNextInterestingPiece(window)) {
> >>>  String content = window.content();
> >>>  if (heuristic.unique(content) > LIMIT) {
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Apr 8, 2009 at 10:17 PM, Alexei Fedotov
> >>> <[email protected]> wrote:
> >>> > Hello Marija,
> >>> > I'm impressed with your progress in learning things and prototyping a
> >>> > solution. As for your question about useful reading, there is a bunch
> >>> > of fun examples of Google Code Search usage here [1].
> >>> >
> >>> > I agree that there is no one best algorithm to approach the problem.
> >>> > My suggestion on this is to make it modular and parameterizable.
> >>> > Instead of removing common code we can just slide a window through the
> >>> > code, e.g. do something like that:
> >>> >
> >>> > IHeuristic heuristic = new DictionaryHeuristic(commonWordProbabilities);
> >>> > ILanguage language = new CPlusPlusLanguage();
> >>> >
> >>> > while (window.slide()) {
> >>> >  String content = window.content();
> >>> >  if (heuristic.unique(content) + language.independent(content) > LIMIT) 
> >>> > {
> >>> >    ISearchResult result =
> >>> > searchEngine.search(language.replaceVarsWithWildcards(content),
> >>> > excludeUrls);
> >>> >    if (result != null) {
> >>> >      report.add(result);
> >>> >    }
> >>> >  }
> >>> > }
> >>> >
> >>> > Please, note that I'm not a fluent Java programmer, hence, you may
> >>> > discover a better way to express this algorithm. The main intention of
> >>> > this example is to show that the algorithm itself may be freed from
> >>> > language and dictionary specifics.
> >>> >
> >>> >
> >>> > BTW. Your English is ok, don't be sorry about it.
> >>> >
> >>> > [1] http://kottke.org/06/10/google-code-search
> >>> >
> >>> >
> >>> > 2009/4/8 Marija Šljivović <[email protected]>:
> >>> >> HI!
> >>> >>
> >>> >> Firstly, I apologise for my English.
> >>> >>
> >>> >> I make a small research for already known algorithms for
> >>> >> code-plagiarism detection There are several of them,
> >>> >> but I am afraid that neither one of them cannot be used in our case
> >>> >> because code for comparation is missing. I will search more about
> >>> >> that.
> >>> >>
> >>> >> Lets talk about things we know now and which I can implement.
> >>> >>
> >>> >> The greatest problem in this tool which have to be solved is that if
> >>> >> this tool is being run on large source-code, time for completely
> >>> >> processing of it will be very long. Because of that we have to make
> >>> >> logic which recognise "critical" code segments.
> >>> >>  We will try to develope our own heuristic which can recognize which
> >>> >> code is "good" for cut&paste, and then pass that code on examination
> >>> >> (of course, we must have an brute-force (will check everything) mode
> >>> >> in this tool).
> >>> >>
> >>> >> There are several ideas for heuristic which will recognise code which
> >>> >> is "good for copy and paste". As I am suggested, which is good and
> >>> >> easy to implement is: If we recognise a comment in source code which
> >>> >> is "uncommon", we can check it on search engines. Identical non-common
> >>> >> comment will be great indication of possible plagiarised code.
> >>> >>  In addition, identical comment with even small amount of "identical"
> >>> >> source code associated with them is the sign of cut&pasted code too.
> >>> >>
> >>> >> If we can somehow recognize "independent" code chunks (almost all
> >>> >> variables are in this part of code and this part of code works only
> >>> >> with this variables and returns or sets single result), like
> >>> >> functions, this code will be sent to be checked because code with
> >>> >> these properties is more often plagiarised.
> >>> >>
> >>> >> Instead of this we can do it in opposite way by removing code which is
> >>> >> common - for example in Java we use setProperty() and getProperty()
> >>> >> methods very often. If we exclude this code of our code for
> >>> >> comparation, we can send all rest to be checked.
> >>> >>
> >>> >> Code "good" for copying is very complex code with little number of
> >>> >> input variables (searching functions, sorting functions...)
> >>> >>
> >>> >> I think that just one thing is sure - there is no generic, universal
> >>> >> algorithm with which we will get good results... Well-documented
> >>> >> combination of rules that defines common copied code in combination
> >>> >> with brute-force approach can get best results.
> >>> >>
> >>> >>
> >>> >>
> >>> >> If anybody has any idea or suggestion about this tool, it will be
> >>> >> great to inform us! We together can define which functionalities will
> >>> >> have this plagiariem checker tool.
> >>> >>
> >>> >> My very basic prototype of tool is here 
> >>> >> http://b0ss.on.neobee.net/MinuliRad/RAT/
> >>> >>
> >>> >> For the end...in my prototype of this tool I use one interface -
> >>> >> ISearchEngine and all parsers for search engines must implement it. If
> >>> >> we can provide Class name of current ISearchEngine implementation to
> >>> >> main program (by argument for example), we can instantiate object of
> >>> >> that class thought reflection. Why? In that way we can separate
> >>> >> parsers from logic and parsers will be "plug ins" - I will make parser
> >>> >> for krugle.org and place it in classpath of this tool, run tool with
> >>> >> ClassName of parser plug in tool arguments and have my own
> >>> >> implementation of krugle.org code search. Somebody else can provide
> >>> >> another implementation..., all plugins will be simple .jar files in
> >>> >> RAT classpath...
> >>> >>
> >>> >>
> >>> >> Best regards,
> >>> >>
> >>> >> Marija Sljivovic
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > With best regards / с наилучшими пожеланиями,
> >>> > Alexei Fedotov / Алексей Федотов,
> >>> > http://www.telecom-express.ru/
> >>> > http://people.apache.org/~aaf/
> >>> > http://harmony.apache.org/
> >>> > http://code.google.com/p/openmeetings/
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> With best regards / с наилучшими пожеланиями,
> >>> Alexei Fedotov / Алексей Федотов,
> >>> http://www.telecom-express.ru/
> >>> http://people.apache.org/~aaf/
> >>> http://harmony.apache.org/
> >>> http://code.google.com/p/openmeetings/
> >>
> >
> >
> >
> > --
> > With best regards / с наилучшими пожеланиями,
> > Alexei Fedotov / Алексей Федотов,
> > http://www.telecom-express.ru/
> > http://people.apache.org/~aaf/
> > http://harmony.apache.org/
> > http://code.google.com/p/openmeetings/
> >
>
>
>
> --
> With best regards / с наилучшими пожеланиями,
> Alexei Fedotov / Алексей Федотов,
> http://www.telecom-express.ru/
> http://people.apache.org/~aaf/
> http://harmony.apache.org/
> http://code.google.com/p/openmeetings/

Re: [rat-cut&paste] engaging with search engine authors

Reply via email to