Robert Burrell Donkin wrote: > Marija Šljivović wrote: >> Hi! >> I am working on copy&paste(plagiarism) detector. > > cool > >> You can see information about project and reports of my progress on this >> locations: >> http://wiki.apache.org/general/MarijaSljivovic/SoC2009ApacheRatProposal >> https://issues.apache.org/jira/browse/RAT-45 >> or get source code and binary distributions on: >> http://code.google.com/p/apache-rat-pd/ >> I think now to make some misspellings heuristic checkers. This algorithms >> will be able to notice some misspelled words in source code. >> Then this part of code will be sent to some of code search >> engines(GoogleCodeSearch for example) to check if it can find any similar >> misspellings in public code bases. >> On that way we can check possibility if code part is plagiarised. >> Now i search for an open source library which can be used for this task. I >> found one: jazzy ( http://jazzy.sourceforge.net/ ) and I think that it is >> good for this purpose. > > probably best to make the API pluggable (jazzy is LGPL but this is good > advice in any case) > >> Any suggestion for other solution that is better then jazzy? > > i'm not sure whether it would be better but an alternative approach > would be to use a semi-structured text analysis tool for example UIMA > (http://incubator.apache.org/uima/) or lucene
for lucene, start by looking at http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/spellchecker/ and then create a custom dictionary by tokenising a large number of source files - robert
