Dear Developers!
I dont know which person to write to, so I write to this mailing-list.
For my diploma (available in german), i have written a similarity search, that for an given document (query) returns documents, which content is gradual similar to the query-document. With this functionality, e.g. different versions from an document, plagiats of a publication or related articels in the archiv of an scientific magazin can be found.
The documents where indexed with lucene 1.4 and represented as termvectors inside the lucene-index. For searching, an real vectorspace-retrievalmodell (not an advanced boolean model) based on the SMART-Retrievalsystem from Gerard Salton was implemented, including tf-idf weighting and cosine-similarity-function. The whole search-space is explored, no heuristical methods are used at time, but can be retrofited.
I have deployed an shortened version of the diploma-prototype, which includes a GUI, one sample document-collection (CIA Factbook) but not the sources of the project:
http://www.informatik.htw-dresden.de/~s4328/pub/diploma_Marcel_Hofmann.zip
The prototype can be started with the prototype/deploy/diploma.bat (sorry to all non Windows users). The included readme.txt lists the original content of the prototype, not the shortened version.
I would like to deploy an library, which contains the core of the implementation (vector-space, cosine-similarity,...) to the lucene-project. But I need some hints to do this...
Greetings from Saxony, Germany Marcel Hofmann [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]