Based on mail from Doug I wrote a "more like this" query generator, named, well, MoreLikeThis. Bruce Ritchie and Mark Harwood made changes to it (esp term vector support) and bug fixes. Thanks to everyone.
I've checked in the code to the sandbox under contributions/similarity.
The package it ends up at is org.apache.lucene.search.similar -- hope that makes sense.
I also created a class, SimilarityQueries, to hold other methods of similarity query generation. The 2 methods in there are "dumber" variations that use the entire source of the target doc to from a large query.
Javadoc is here:
http://www.searchmorph.com/pub/jakarta-lucene-sandbox/contributions/similarity/build/docs/api/org/apache/lucene/search/similar/package-summary.html
Online demo here - this page below compares the 3 variations on detecting similar docs. The timing info (3 numbers w/ "(ms)") may be suspect. Also note if you scroll to the bottom you can see the queries that were generated.
Here's a page showing docs similar to the entry for Iraq:
http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Iraq
And here's one for docs similar to the one on Garry Kasparov (he knows how to play chess :) ):
http://www.searchmorph.com/kat/wikipedia-compare.jsp?s=Garry_Kasparov
To get to it you start here:
http://www.searchmorph.com/kat/wikipedia.jsp
And search for something - on the search results page follow a "cmp" link
http://www.searchmorph.com/kat/wikipedia.jsp?s=iraq
Make sense? Useful? Has anyone done any other variations (e.g. cosine measure)?
- Dave
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]