Rat folks, I've got one more question. Can you suggest anything from RAT scanning engine which can be reused in scanning sources for cut & paste?
Thanks! On Thu, Apr 9, 2009 at 8:37 PM, Alexei Fedotov <[email protected]> wrote: > Robert, all, > Do I understand correctly that no ACQ [3] analogues should be signed > by Marija to contribute to RAT? Is it true that having ICLA for > Harmony I can start committing her code as a separate RAT module > provided we obey to accepted conventions and Marija checked an ASF > checkbox for her code? > > [3] http://harmony.apache.org/auth_cont_quest.html > > > > > > 2009/4/9 Marija Šljivović <[email protected]>: >> Hi! >> >> Firstly thanks for ideas! >> >> When I read this mails I saw a great idea - all heuristic algorithms >> can return int value(not only boolean - good to be checked, not good >> to be checked). I have written pseudocode where I use this idea. In >> that way we don't need to make two algorithms - for bruteforce and for >> heuristic check. In addition we can set how aggressive will tool work >> by seting LIMIT. >> If we use booleans for checkByHueristic we can not be enough accurate >> when to use searchEngine. >> When we use integers and sum it like in bottom algorithm we can >> measure how it is important to check code in searchEngine. >> Benefits of this approach: >> We do not have to change core sliding window algorithm. >> We can later add more IHueristicChecker changing algorithm. >> We can choose which algorithms we will use in specific situations too. >> >> //all hueristic checkers must implements this interface >> interface IHueristicChecker{ >> /** >> * @param codeToBeChecked >> * @return int value from 1 to 100 >> */ >> int checkByHueristic(String codeToBeChecked); >> >> } >> >> /** >> * Limit is user-defined constants 1-100 and it is used to >> * measure how aggressively will be used search engines >> * if we put there 10 =little usage of internet (search engines) >> * if we put there 90 =almost bruteforce checking will be run >> */ >> int LIMIT; >> >> //we create rules which we wish to use >> //this rules can be easy configured by user >> //by parameters or... >> IHueristicChecker commentsHueristicChecker = new >> CommentsHueristicChecker(programingLanguage...); >> IHueristicChecker missspelingsHueristicChecker = new >> MisspelingsHueristicChecker(commonLanguagePhrases...); >> IHueristicChecker goodFunctionToBeCopiedHueristicChecker = new >> GoodFunctionToBeCopiedHueristicChecker(params...); >> >> //we add all rules we ar interested in >> List<IHueristicChecker> algorithmsForChecking = new >> ArrayList<IHueristicChecker>(); >> algorithmsForChecking.add(commentsHueristicChecker); >> algorithmsForChecking.add(missspelingsHueristicChecker); >> algorithmsForChecking.add(bruteForceChecker); >> >> >> ... >> //then in sliding window algorithm we use this checkers in part of >> algorihm where we >> //know current window content >> int sum = 0; >> for(IHueristicChecker checker : algorithmsForChecking) >> { >> sum += checker.checkByHueristic(codeToBeChecked); >> } >> >> if (sum/algorithmsForChecking.size() < LIMIT) >> { >> //we think that heuristic show us that this code must be checked >> searchEngine.checkCodeOnInternet(codeToBeChecked); >> >> } >> ... >> >> P.S. >> I updated my plans on http://b0ss.on.neobee.net/MinuliRad/RAT/ ... >> >> >> >> >> On Thu, Apr 9, 2009 at 8:21 AM, Alexei Fedotov <[email protected]> >> wrote: >>> >>> Folks, Marija, >>> >>> I was thinking about proper reading. I believe it worth approaching >>> code search engine authors and asking them for advise. I think it is >>> more appropriate to approach Google first because they are known to be >>> open source friendly and because we are somehow competing with Krugle >>> - AFAIK, they have a commercial anti-plagiarism tool. >>> >>> Thanks. >>> >>> BTW. As for algorithm example, the sliding window is language >>> dependent. Thus, sliding becomes a method of a language: >>> >>> while (language.slideToNextInterestingPiece(window)) { >>> String content = window.content(); >>> if (heuristic.unique(content) > LIMIT) { >>> >>> >>> >>> >>> >>> On Wed, Apr 8, 2009 at 10:17 PM, Alexei Fedotov >>> <[email protected]> wrote: >>> > Hello Marija, >>> > I'm impressed with your progress in learning things and prototyping a >>> > solution. As for your question about useful reading, there is a bunch >>> > of fun examples of Google Code Search usage here [1]. >>> > >>> > I agree that there is no one best algorithm to approach the problem. >>> > My suggestion on this is to make it modular and parameterizable. >>> > Instead of removing common code we can just slide a window through the >>> > code, e.g. do something like that: >>> > >>> > IHeuristic heuristic = new DictionaryHeuristic(commonWordProbabilities); >>> > ILanguage language = new CPlusPlusLanguage(); >>> > >>> > while (window.slide()) { >>> > String content = window.content(); >>> > if (heuristic.unique(content) + language.independent(content) > LIMIT) { >>> > ISearchResult result = >>> > searchEngine.search(language.replaceVarsWithWildcards(content), >>> > excludeUrls); >>> > if (result != null) { >>> > report.add(result); >>> > } >>> > } >>> > } >>> > >>> > Please, note that I'm not a fluent Java programmer, hence, you may >>> > discover a better way to express this algorithm. The main intention of >>> > this example is to show that the algorithm itself may be freed from >>> > language and dictionary specifics. >>> > >>> > >>> > BTW. Your English is ok, don't be sorry about it. >>> > >>> > [1] http://kottke.org/06/10/google-code-search >>> > >>> > >>> > 2009/4/8 Marija Šljivović <[email protected]>: >>> >> HI! >>> >> >>> >> Firstly, I apologise for my English. >>> >> >>> >> I make a small research for already known algorithms for >>> >> code-plagiarism detection There are several of them, >>> >> but I am afraid that neither one of them cannot be used in our case >>> >> because code for comparation is missing. I will search more about >>> >> that. >>> >> >>> >> Lets talk about things we know now and which I can implement. >>> >> >>> >> The greatest problem in this tool which have to be solved is that if >>> >> this tool is being run on large source-code, time for completely >>> >> processing of it will be very long. Because of that we have to make >>> >> logic which recognise "critical" code segments. >>> >> We will try to develope our own heuristic which can recognize which >>> >> code is "good" for cut&paste, and then pass that code on examination >>> >> (of course, we must have an brute-force (will check everything) mode >>> >> in this tool). >>> >> >>> >> There are several ideas for heuristic which will recognise code which >>> >> is "good for copy and paste". As I am suggested, which is good and >>> >> easy to implement is: If we recognise a comment in source code which >>> >> is "uncommon", we can check it on search engines. Identical non-common >>> >> comment will be great indication of possible plagiarised code. >>> >> In addition, identical comment with even small amount of "identical" >>> >> source code associated with them is the sign of cut&pasted code too. >>> >> >>> >> If we can somehow recognize "independent" code chunks (almost all >>> >> variables are in this part of code and this part of code works only >>> >> with this variables and returns or sets single result), like >>> >> functions, this code will be sent to be checked because code with >>> >> these properties is more often plagiarised. >>> >> >>> >> Instead of this we can do it in opposite way by removing code which is >>> >> common - for example in Java we use setProperty() and getProperty() >>> >> methods very often. If we exclude this code of our code for >>> >> comparation, we can send all rest to be checked. >>> >> >>> >> Code "good" for copying is very complex code with little number of >>> >> input variables (searching functions, sorting functions...) >>> >> >>> >> I think that just one thing is sure - there is no generic, universal >>> >> algorithm with which we will get good results... Well-documented >>> >> combination of rules that defines common copied code in combination >>> >> with brute-force approach can get best results. >>> >> >>> >> >>> >> >>> >> If anybody has any idea or suggestion about this tool, it will be >>> >> great to inform us! We together can define which functionalities will >>> >> have this plagiariem checker tool. >>> >> >>> >> My very basic prototype of tool is here >>> >> http://b0ss.on.neobee.net/MinuliRad/RAT/ >>> >> >>> >> For the end...in my prototype of this tool I use one interface - >>> >> ISearchEngine and all parsers for search engines must implement it. If >>> >> we can provide Class name of current ISearchEngine implementation to >>> >> main program (by argument for example), we can instantiate object of >>> >> that class thought reflection. Why? In that way we can separate >>> >> parsers from logic and parsers will be "plug ins" - I will make parser >>> >> for krugle.org and place it in classpath of this tool, run tool with >>> >> ClassName of parser plug in tool arguments and have my own >>> >> implementation of krugle.org code search. Somebody else can provide >>> >> another implementation..., all plugins will be simple .jar files in >>> >> RAT classpath... >>> >> >>> >> >>> >> Best regards, >>> >> >>> >> Marija Sljivovic >>> >> >>> > >>> > >>> > >>> > -- >>> > With best regards / с наилучшими пожеланиями, >>> > Alexei Fedotov / Алексей Федотов, >>> > http://www.telecom-express.ru/ >>> > http://people.apache.org/~aaf/ >>> > http://harmony.apache.org/ >>> > http://code.google.com/p/openmeetings/ >>> > >>> >>> >>> >>> -- >>> With best regards / с наилучшими пожеланиями, >>> Alexei Fedotov / Алексей Федотов, >>> http://www.telecom-express.ru/ >>> http://people.apache.org/~aaf/ >>> http://harmony.apache.org/ >>> http://code.google.com/p/openmeetings/ >> > > > > -- > With best regards / с наилучшими пожеланиями, > Alexei Fedotov / Алексей Федотов, > http://www.telecom-express.ru/ > http://people.apache.org/~aaf/ > http://harmony.apache.org/ > http://code.google.com/p/openmeetings/ > -- With best regards / с наилучшими пожеланиями, Alexei Fedotov / Алексей Федотов, http://www.telecom-express.ru/ http://people.apache.org/~aaf/ http://harmony.apache.org/ http://code.google.com/p/openmeetings/
