Marija, I like your plan, and I like that you get acquainted with Apache infrastructure. The next step is to submit your prototype to JIRA. When you sign an ASF grant using JIRA checkbox, I start reviewing and committing your code.
As for return type for heuristics, I don't think I'm 100% correct. Sorry for being contradictory. There are some metrics, for example, a number of pages shown by a search engine for a given phrase, which are naturally expressed in integers. This should not stop you from settling to a variant you like more and developing inter-component interfaces. I find this discussion useful, because we come to a definition of high level design components: languages, heuristics, configuration managers (e.g. command line or property parsers) and search engine wrappers. What else have I missed in this list considering a minimal set of modules? Best regards! 2009/4/14 Marija Šljivović <[email protected]>: > Hi! > > I made a page about this project on Apache Wiki: > http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal > I plan to post there everything about this project. > > About return results type which will be returned from each heuristic > algorithm: > We can use integers(1-100 for example) to measure how each algorithm > is sure about need of checking code on searchEngine, or we can use > boolean as returned type. > If we use integers we can easy define how aggressive will be used > searchEngine-s(by program parameter for example). But natural place of > this integer value for defining if this tool will use more or less > searchEngine resources is in specific heuristicAlgorithm > implementation. We can only set it from outside. > So, I think that Alexei is correct. Booleans will be better choice if > we move some logic in IHueristicChecker Implementations. > > All the best! > Marija > > > > > On Thu, Apr 9, 2009 at 6:48 PM, Alexei Fedotov <[email protected]> > wrote: >> >> Rat folks, >> I've got one more question. Can you suggest anything from RAT scanning >> engine which can be reused in scanning sources for cut & paste? >> >> Thanks! >> >> >> On Thu, Apr 9, 2009 at 8:37 PM, Alexei Fedotov <[email protected]> >> wrote: >> > Robert, all, >> > Do I understand correctly that no ACQ [3] analogues should be signed >> > by Marija to contribute to RAT? Is it true that having ICLA for >> > Harmony I can start committing her code as a separate RAT module >> > provided we obey to accepted conventions and Marija checked an ASF >> > checkbox for her code? >> > >> > [3] http://harmony.apache.org/auth_cont_quest.html >> > >> > >> > >> > >> > >> > 2009/4/9 Marija Šljivović <[email protected]>: >> >> Hi! >> >> >> >> Firstly thanks for ideas! >> >> >> >> When I read this mails I saw a great idea - all heuristic algorithms >> >> can return int value(not only boolean - good to be checked, not good >> >> to be checked). I have written pseudocode where I use this idea. In >> >> that way we don't need to make two algorithms - for bruteforce and for >> >> heuristic check. In addition we can set how aggressive will tool work >> >> by seting LIMIT. >> >> If we use booleans for checkByHueristic we can not be enough accurate >> >> when to use searchEngine. >> >> When we use integers and sum it like in bottom algorithm we can >> >> measure how it is important to check code in searchEngine. >> >> Benefits of this approach: >> >> We do not have to change core sliding window algorithm. >> >> We can later add more IHueristicChecker changing algorithm. >> >> We can choose which algorithms we will use in specific situations too. >> >> >> >> //all hueristic checkers must implements this interface >> >> interface IHueristicChecker{ >> >> /** >> >> * @param codeToBeChecked >> >> * @return int value from 1 to 100 >> >> */ >> >> int checkByHueristic(String codeToBeChecked); >> >> >> >> } >> >> >> >> /** >> >> * Limit is user-defined constants 1-100 and it is used to >> >> * measure how aggressively will be used search engines >> >> * if we put there 10 =little usage of internet (search engines) >> >> * if we put there 90 =almost bruteforce checking will be run >> >> */ >> >> int LIMIT; >> >> >> >> //we create rules which we wish to use >> >> //this rules can be easy configured by user >> >> //by parameters or... >> >> IHueristicChecker commentsHueristicChecker = new >> >> CommentsHueristicChecker(programingLanguage...); >> >> IHueristicChecker missspelingsHueristicChecker = new >> >> MisspelingsHueristicChecker(commonLanguagePhrases...); >> >> IHueristicChecker goodFunctionToBeCopiedHueristicChecker = new >> >> GoodFunctionToBeCopiedHueristicChecker(params...); >> >> >> >> //we add all rules we ar interested in >> >> List<IHueristicChecker> algorithmsForChecking = new >> >> ArrayList<IHueristicChecker>(); >> >> algorithmsForChecking.add(commentsHueristicChecker); >> >> algorithmsForChecking.add(missspelingsHueristicChecker); >> >> algorithmsForChecking.add(bruteForceChecker); >> >> >> >> >> >> ... >> >> //then in sliding window algorithm we use this checkers in part of >> >> algorihm where we >> >> //know current window content >> >> int sum = 0; >> >> for(IHueristicChecker checker : algorithmsForChecking) >> >> { >> >> sum += checker.checkByHueristic(codeToBeChecked); >> >> } >> >> >> >> if (sum/algorithmsForChecking.size() < LIMIT) >> >> { >> >> //we think that heuristic show us that this code must be checked >> >> searchEngine.checkCodeOnInternet(codeToBeChecked); >> >> >> >> } >> >> ... >> >> >> >> P.S. >> >> I updated my plans on http://b0ss.on.neobee.net/MinuliRad/RAT/ ... >> >> >> >> >> >> >> >> >> >> On Thu, Apr 9, 2009 at 8:21 AM, Alexei Fedotov <[email protected]> >> >> wrote: >> >>> >> >>> Folks, Marija, >> >>> >> >>> I was thinking about proper reading. I believe it worth approaching >> >>> code search engine authors and asking them for advise. I think it is >> >>> more appropriate to approach Google first because they are known to be >> >>> open source friendly and because we are somehow competing with Krugle >> >>> - AFAIK, they have a commercial anti-plagiarism tool. >> >>> >> >>> Thanks. >> >>> >> >>> BTW. As for algorithm example, the sliding window is language >> >>> dependent. Thus, sliding becomes a method of a language: >> >>> >> >>> while (language.slideToNextInterestingPiece(window)) { >> >>> String content = window.content(); >> >>> if (heuristic.unique(content) > LIMIT) { >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Wed, Apr 8, 2009 at 10:17 PM, Alexei Fedotov >> >>> <[email protected]> wrote: >> >>> > Hello Marija, >> >>> > I'm impressed with your progress in learning things and prototyping a >> >>> > solution. As for your question about useful reading, there is a bunch >> >>> > of fun examples of Google Code Search usage here [1]. >> >>> > >> >>> > I agree that there is no one best algorithm to approach the problem. >> >>> > My suggestion on this is to make it modular and parameterizable. >> >>> > Instead of removing common code we can just slide a window through the >> >>> > code, e.g. do something like that: >> >>> > >> >>> > IHeuristic heuristic = new >> >>> > DictionaryHeuristic(commonWordProbabilities); >> >>> > ILanguage language = new CPlusPlusLanguage(); >> >>> > >> >>> > while (window.slide()) { >> >>> > String content = window.content(); >> >>> > if (heuristic.unique(content) + language.independent(content) > >> >>> > LIMIT) { >> >>> > ISearchResult result = >> >>> > searchEngine.search(language.replaceVarsWithWildcards(content), >> >>> > excludeUrls); >> >>> > if (result != null) { >> >>> > report.add(result); >> >>> > } >> >>> > } >> >>> > } >> >>> > >> >>> > Please, note that I'm not a fluent Java programmer, hence, you may >> >>> > discover a better way to express this algorithm. The main intention of >> >>> > this example is to show that the algorithm itself may be freed from >> >>> > language and dictionary specifics. >> >>> > >> >>> > >> >>> > BTW. Your English is ok, don't be sorry about it. >> >>> > >> >>> > [1] http://kottke.org/06/10/google-code-search >> >>> > >> >>> > >> >>> > 2009/4/8 Marija Šljivović <[email protected]>: >> >>> >> HI! >> >>> >> >> >>> >> Firstly, I apologise for my English. >> >>> >> >> >>> >> I make a small research for already known algorithms for >> >>> >> code-plagiarism detection There are several of them, >> >>> >> but I am afraid that neither one of them cannot be used in our case >> >>> >> because code for comparation is missing. I will search more about >> >>> >> that. >> >>> >> >> >>> >> Lets talk about things we know now and which I can implement. >> >>> >> >> >>> >> The greatest problem in this tool which have to be solved is that if >> >>> >> this tool is being run on large source-code, time for completely >> >>> >> processing of it will be very long. Because of that we have to make >> >>> >> logic which recognise "critical" code segments. >> >>> >> We will try to develope our own heuristic which can recognize which >> >>> >> code is "good" for cut&paste, and then pass that code on examination >> >>> >> (of course, we must have an brute-force (will check everything) mode >> >>> >> in this tool). >> >>> >> >> >>> >> There are several ideas for heuristic which will recognise code which >> >>> >> is "good for copy and paste". As I am suggested, which is good and >> >>> >> easy to implement is: If we recognise a comment in source code which >> >>> >> is "uncommon", we can check it on search engines. Identical non-common >> >>> >> comment will be great indication of possible plagiarised code. >> >>> >> In addition, identical comment with even small amount of "identical" >> >>> >> source code associated with them is the sign of cut&pasted code too. >> >>> >> >> >>> >> If we can somehow recognize "independent" code chunks (almost all >> >>> >> variables are in this part of code and this part of code works only >> >>> >> with this variables and returns or sets single result), like >> >>> >> functions, this code will be sent to be checked because code with >> >>> >> these properties is more often plagiarised. >> >>> >> >> >>> >> Instead of this we can do it in opposite way by removing code which is >> >>> >> common - for example in Java we use setProperty() and getProperty() >> >>> >> methods very often. If we exclude this code of our code for >> >>> >> comparation, we can send all rest to be checked. >> >>> >> >> >>> >> Code "good" for copying is very complex code with little number of >> >>> >> input variables (searching functions, sorting functions...) >> >>> >> >> >>> >> I think that just one thing is sure - there is no generic, universal >> >>> >> algorithm with which we will get good results... Well-documented >> >>> >> combination of rules that defines common copied code in combination >> >>> >> with brute-force approach can get best results. >> >>> >> >> >>> >> >> >>> >> >> >>> >> If anybody has any idea or suggestion about this tool, it will be >> >>> >> great to inform us! We together can define which functionalities will >> >>> >> have this plagiariem checker tool. >> >>> >> >> >>> >> My very basic prototype of tool is here >> >>> >> http://b0ss.on.neobee.net/MinuliRad/RAT/ >> >>> >> >> >>> >> For the end...in my prototype of this tool I use one interface - >> >>> >> ISearchEngine and all parsers for search engines must implement it. If >> >>> >> we can provide Class name of current ISearchEngine implementation to >> >>> >> main program (by argument for example), we can instantiate object of >> >>> >> that class thought reflection. Why? In that way we can separate >> >>> >> parsers from logic and parsers will be "plug ins" - I will make parser >> >>> >> for krugle.org and place it in classpath of this tool, run tool with >> >>> >> ClassName of parser plug in tool arguments and have my own >> >>> >> implementation of krugle.org code search. Somebody else can provide >> >>> >> another implementation..., all plugins will be simple .jar files in >> >>> >> RAT classpath... >> >>> >> >> >>> >> >> >>> >> Best regards, >> >>> >> >> >>> >> Marija Sljivovic >> >>> >> >> >>> > >> >>> > >> >>> > >> >>> > -- >> >>> > With best regards / с наилучшими пожеланиями, >> >>> > Alexei Fedotov / Алексей Федотов, >> >>> > http://www.telecom-express.ru/ >> >>> > http://people.apache.org/~aaf/ >> >>> > http://harmony.apache.org/ >> >>> > http://code.google.com/p/openmeetings/ >> >>> > >> >>> >> >>> >> >>> >> >>> -- >> >>> With best regards / с наилучшими пожеланиями, >> >>> Alexei Fedotov / Алексей Федотов, >> >>> http://www.telecom-express.ru/ >> >>> http://people.apache.org/~aaf/ >> >>> http://harmony.apache.org/ >> >>> http://code.google.com/p/openmeetings/ >> >> >> > >> > >> > >> > -- >> > With best regards / с наилучшими пожеланиями, >> > Alexei Fedotov / Алексей Федотов, >> > http://www.telecom-express.ru/ >> > http://people.apache.org/~aaf/ >> > http://harmony.apache.org/ >> > http://code.google.com/p/openmeetings/ >> > >> >> >> >> -- >> With best regards / с наилучшими пожеланиями, >> Alexei Fedotov / Алексей Федотов, >> http://www.telecom-express.ru/ >> http://people.apache.org/~aaf/ >> http://harmony.apache.org/ >> http://code.google.com/p/openmeetings/ > -- With best regards / с наилучшими пожеланиями, Alexei Fedotov / Алексей Федотов, http://www.telecom-express.ru/ http://people.apache.org/~aaf/ http://harmony.apache.org/ http://code.google.com/p/openmeetings/
