Hey everyone. I have been trying to get a certain kind of delete duplicates working, but I need a little help.
Here is my problem. I have many documents, that after a web crawl, many different sites could have documents that have similar titles. I want to remove all of those documents except for 1. So, I could have a list of titles like this 1) George the Monkey won the bowl 2) The bowl was won by George the Monkey 3) Bowl won by George the Monkey So, the way I do things now, I generate a query like this +title:George +title:Monkey +title:Bowl +title:won +title:the and then do a search. It will then pull back documents. Now, my first, bad, way of deleting the dupes was to check for scores > some number and then delete them. However, as my index/crawler(nucth) kept generating, and I kept merging indexes, the scores kept on getting weirdly different. So, I found this forum item on similarity: http://www.nabble.com/Overriding-Similarity-tf2128934.html#a5875307 and wanted to know if that was a good way of finding these duplicate title matches. Or, if someone else had a good idea on how to find them? Now, the titles are not going to be exact, but fairly similar. Thanks for your help, Scott -- View this message in context: http://www.nabble.com/similarity-and-delete-duplicates-tf3222213.html#a8949366 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]