Re: [rat-cut&paste] engaging with search engine authors

Alexei Fedotov Tue, 14 Apr 2009 09:14:37 -0700

Marija,
I like your plan, and I like that you get acquainted with Apache
infrastructure. The next step is to submit your prototype to JIRA.
When you sign an ASF grant using JIRA checkbox, I start reviewing and
committing your code.


As for return type for heuristics, I don't think I'm 100% correct.
Sorry for being contradictory. There are some metrics, for example, a
number of pages shown by a search engine for a given phrase, which are
naturally expressed in integers. This should not stop you from
settling to a variant you like more and developing inter-component
interfaces.

I find this discussion useful, because we come to a definition of high
level design components: languages, heuristics, configuration managers
(e.g. command line or property parsers) and search engine wrappers.
What else have I missed in this list considering a minimal set of
modules?

Best regards!

2009/4/14 Marija Šljivović <[email protected]>:
> Hi!
>
> I made a page about this project on Apache Wiki:
> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal
> I plan to post there everything about this project.
>
> About return results type which will be returned from each heuristic 
> algorithm:
> We can use integers(1-100 for example) to measure how each algorithm
> is sure about need of checking code on searchEngine, or we can use
> boolean as returned type.
> If we use integers we can easy define how aggressive will be used
> searchEngine-s(by program parameter for example). But natural place of
> this  integer value for defining if this tool will use more or less
> searchEngine resources is in specific heuristicAlgorithm
> implementation. We can only set it from outside.
> So, I think that Alexei is correct. Booleans will be better choice if
> we move some logic in IHueristicChecker Implementations.
>
> All the best!
> Marija
>
>
>
>
> On Thu, Apr 9, 2009 at 6:48 PM, Alexei Fedotov <[email protected]> 
> wrote:
>>
>> Rat folks,
>> I've got one more question. Can you suggest anything from RAT scanning
>> engine which can be reused in scanning sources for cut & paste?
>>
>> Thanks!
>>
>>
>> On Thu, Apr 9, 2009 at 8:37 PM, Alexei Fedotov <[email protected]> 
>> wrote:
>> > Robert, all,
>> > Do I understand correctly that no ACQ [3] analogues should be signed
>> > by Marija to contribute to RAT? Is it true that having ICLA for
>> > Harmony I can start committing her code as a separate RAT module
>> > provided we obey to accepted conventions and Marija checked an ASF
>> > checkbox for her code?
>> >
>> > [3] http://harmony.apache.org/auth_cont_quest.html
>> >
>> >
>> >
>> >
>> >
>> > 2009/4/9 Marija Šljivović <[email protected]>:
>> >> Hi!
>> >>
>> >> Firstly thanks for ideas!
>> >>
>> >> When I read this mails I saw a great idea - all heuristic algorithms
>> >> can return int value(not only boolean - good to be checked, not good
>> >> to be checked). I have written pseudocode where I use this idea. In
>> >> that way we don't need to make two algorithms - for bruteforce and for
>> >> heuristic check. In addition we can set how aggressive will tool work
>> >> by seting LIMIT.
>> >> If we use booleans for checkByHueristic we can not be enough accurate
>> >> when to use searchEngine.
>> >> When we use integers and sum it like in bottom algorithm we can
>> >> measure how it is important to check code in searchEngine.
>> >> Benefits of this approach:
>> >> We do not have to change core sliding window algorithm.
>> >> We can later add more IHueristicChecker changing algorithm.
>> >> We can choose which algorithms we will use in specific situations too.
>> >>
>> >> //all hueristic checkers must implements this interface
>> >> interface IHueristicChecker{
>> >>      /**
>> >>     * @param codeToBeChecked
>> >>     * @return int value from 1 to 100
>> >>     */
>> >>     int checkByHueristic(String codeToBeChecked);
>> >>
>> >> }
>> >>
>> >> /**
>> >> * Limit is user-defined constants 1-100 and it is used to
>> >> * measure how aggressively will be used search engines
>> >> * if we put there 10 =little usage of internet (search engines)
>> >> * if we put there 90 =almost bruteforce checking will be run
>> >> */
>> >> int LIMIT;
>> >>
>> >> //we create rules which we wish to use
>> >> //this rules can be easy configured by user
>> >> //by parameters or...
>> >> IHueristicChecker commentsHueristicChecker = new
>> >> CommentsHueristicChecker(programingLanguage...);
>> >> IHueristicChecker missspelingsHueristicChecker = new
>> >> MisspelingsHueristicChecker(commonLanguagePhrases...);
>> >> IHueristicChecker goodFunctionToBeCopiedHueristicChecker = new
>> >> GoodFunctionToBeCopiedHueristicChecker(params...);
>> >>
>> >> //we add all rules we ar interested in
>> >> List<IHueristicChecker> algorithmsForChecking = new
>> >> ArrayList<IHueristicChecker>();
>> >> algorithmsForChecking.add(commentsHueristicChecker);
>> >> algorithmsForChecking.add(missspelingsHueristicChecker);
>> >> algorithmsForChecking.add(bruteForceChecker);
>> >>
>> >>
>> >> ...
>> >> //then in sliding window algorithm we use this checkers in part of
>> >> algorihm where we
>> >> //know current window content
>> >>   int sum = 0;
>> >>   for(IHueristicChecker checker : algorithmsForChecking)
>> >>    {
>> >>        sum += checker.checkByHueristic(codeToBeChecked);
>> >>    }
>> >>
>> >> if (sum/algorithmsForChecking.size() < LIMIT)
>> >> {
>> >>     //we think that heuristic show us that this code must be checked
>> >>     searchEngine.checkCodeOnInternet(codeToBeChecked);
>> >>
>> >> }
>> >> ...
>> >>
>> >> P.S.
>> >> I updated my plans on http://b0ss.on.neobee.net/MinuliRad/RAT/ ...
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Apr 9, 2009 at 8:21 AM, Alexei Fedotov <[email protected]> 
>> >> wrote:
>> >>>
>> >>> Folks, Marija,
>> >>>
>> >>> I was thinking about proper reading. I believe it worth approaching
>> >>> code search engine authors and asking them for advise. I think it is
>> >>> more appropriate to approach Google first because they are known to be
>> >>> open source friendly and because we are somehow competing with Krugle
>> >>> - AFAIK, they have a commercial anti-plagiarism tool.
>> >>>
>> >>> Thanks.
>> >>>
>> >>> BTW. As for algorithm example, the sliding window is language
>> >>> dependent. Thus, sliding becomes a method of a language:
>> >>>
>> >>> while (language.slideToNextInterestingPiece(window)) {
>> >>>  String content = window.content();
>> >>>  if (heuristic.unique(content) > LIMIT) {
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Apr 8, 2009 at 10:17 PM, Alexei Fedotov
>> >>> <[email protected]> wrote:
>> >>> > Hello Marija,
>> >>> > I'm impressed with your progress in learning things and prototyping a
>> >>> > solution. As for your question about useful reading, there is a bunch
>> >>> > of fun examples of Google Code Search usage here [1].
>> >>> >
>> >>> > I agree that there is no one best algorithm to approach the problem.
>> >>> > My suggestion on this is to make it modular and parameterizable.
>> >>> > Instead of removing common code we can just slide a window through the
>> >>> > code, e.g. do something like that:
>> >>> >
>> >>> > IHeuristic heuristic = new 
>> >>> > DictionaryHeuristic(commonWordProbabilities);
>> >>> > ILanguage language = new CPlusPlusLanguage();
>> >>> >
>> >>> > while (window.slide()) {
>> >>> >  String content = window.content();
>> >>> >  if (heuristic.unique(content) + language.independent(content) > 
>> >>> > LIMIT) {
>> >>> >    ISearchResult result =
>> >>> > searchEngine.search(language.replaceVarsWithWildcards(content),
>> >>> > excludeUrls);
>> >>> >    if (result != null) {
>> >>> >      report.add(result);
>> >>> >    }
>> >>> >  }
>> >>> > }
>> >>> >
>> >>> > Please, note that I'm not a fluent Java programmer, hence, you may
>> >>> > discover a better way to express this algorithm. The main intention of
>> >>> > this example is to show that the algorithm itself may be freed from
>> >>> > language and dictionary specifics.
>> >>> >
>> >>> >
>> >>> > BTW. Your English is ok, don't be sorry about it.
>> >>> >
>> >>> > [1] http://kottke.org/06/10/google-code-search
>> >>> >
>> >>> >
>> >>> > 2009/4/8 Marija Šljivović <[email protected]>:
>> >>> >> HI!
>> >>> >>
>> >>> >> Firstly, I apologise for my English.
>> >>> >>
>> >>> >> I make a small research for already known algorithms for
>> >>> >> code-plagiarism detection There are several of them,
>> >>> >> but I am afraid that neither one of them cannot be used in our case
>> >>> >> because code for comparation is missing. I will search more about
>> >>> >> that.
>> >>> >>
>> >>> >> Lets talk about things we know now and which I can implement.
>> >>> >>
>> >>> >> The greatest problem in this tool which have to be solved is that if
>> >>> >> this tool is being run on large source-code, time for completely
>> >>> >> processing of it will be very long. Because of that we have to make
>> >>> >> logic which recognise "critical" code segments.
>> >>> >>  We will try to develope our own heuristic which can recognize which
>> >>> >> code is "good" for cut&paste, and then pass that code on examination
>> >>> >> (of course, we must have an brute-force (will check everything) mode
>> >>> >> in this tool).
>> >>> >>
>> >>> >> There are several ideas for heuristic which will recognise code which
>> >>> >> is "good for copy and paste". As I am suggested, which is good and
>> >>> >> easy to implement is: If we recognise a comment in source code which
>> >>> >> is "uncommon", we can check it on search engines. Identical non-common
>> >>> >> comment will be great indication of possible plagiarised code.
>> >>> >>  In addition, identical comment with even small amount of "identical"
>> >>> >> source code associated with them is the sign of cut&pasted code too.
>> >>> >>
>> >>> >> If we can somehow recognize "independent" code chunks (almost all
>> >>> >> variables are in this part of code and this part of code works only
>> >>> >> with this variables and returns or sets single result), like
>> >>> >> functions, this code will be sent to be checked because code with
>> >>> >> these properties is more often plagiarised.
>> >>> >>
>> >>> >> Instead of this we can do it in opposite way by removing code which is
>> >>> >> common - for example in Java we use setProperty() and getProperty()
>> >>> >> methods very often. If we exclude this code of our code for
>> >>> >> comparation, we can send all rest to be checked.
>> >>> >>
>> >>> >> Code "good" for copying is very complex code with little number of
>> >>> >> input variables (searching functions, sorting functions...)
>> >>> >>
>> >>> >> I think that just one thing is sure - there is no generic, universal
>> >>> >> algorithm with which we will get good results... Well-documented
>> >>> >> combination of rules that defines common copied code in combination
>> >>> >> with brute-force approach can get best results.
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> If anybody has any idea or suggestion about this tool, it will be
>> >>> >> great to inform us! We together can define which functionalities will
>> >>> >> have this plagiariem checker tool.
>> >>> >>
>> >>> >> My very basic prototype of tool is here 
>> >>> >> http://b0ss.on.neobee.net/MinuliRad/RAT/
>> >>> >>
>> >>> >> For the end...in my prototype of this tool I use one interface -
>> >>> >> ISearchEngine and all parsers for search engines must implement it. If
>> >>> >> we can provide Class name of current ISearchEngine implementation to
>> >>> >> main program (by argument for example), we can instantiate object of
>> >>> >> that class thought reflection. Why? In that way we can separate
>> >>> >> parsers from logic and parsers will be "plug ins" - I will make parser
>> >>> >> for krugle.org and place it in classpath of this tool, run tool with
>> >>> >> ClassName of parser plug in tool arguments and have my own
>> >>> >> implementation of krugle.org code search. Somebody else can provide
>> >>> >> another implementation..., all plugins will be simple .jar files in
>> >>> >> RAT classpath...
>> >>> >>
>> >>> >>
>> >>> >> Best regards,
>> >>> >>
>> >>> >> Marija Sljivovic
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> > With best regards / с наилучшими пожеланиями,
>> >>> > Alexei Fedotov / Алексей Федотов,
>> >>> > http://www.telecom-express.ru/
>> >>> > http://people.apache.org/~aaf/
>> >>> > http://harmony.apache.org/
>> >>> > http://code.google.com/p/openmeetings/
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> With best regards / с наилучшими пожеланиями,
>> >>> Alexei Fedotov / Алексей Федотов,
>> >>> http://www.telecom-express.ru/
>> >>> http://people.apache.org/~aaf/
>> >>> http://harmony.apache.org/
>> >>> http://code.google.com/p/openmeetings/
>> >>
>> >
>> >
>> >
>> > --
>> > With best regards / с наилучшими пожеланиями,
>> > Alexei Fedotov / Алексей Федотов,
>> > http://www.telecom-express.ru/
>> > http://people.apache.org/~aaf/
>> > http://harmony.apache.org/
>> > http://code.google.com/p/openmeetings/
>> >
>>
>>
>>
>> --
>> With best regards / с наилучшими пожеланиями,
>> Alexei Fedotov / Алексей Федотов,
>> http://www.telecom-express.ru/
>> http://people.apache.org/~aaf/
>> http://harmony.apache.org/
>> http://code.google.com/p/openmeetings/
>



-- 
With best regards / с наилучшими пожеланиями,
Alexei Fedotov / Алексей Федотов,
http://www.telecom-express.ru/
http://people.apache.org/~aaf/
http://harmony.apache.org/
http://code.google.com/p/openmeetings/

Re: [rat-cut&paste] engaging with search engine authors

Reply via email to