On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote:
Find Me wrote: > How to eliminate near duplicates from the index? Someone suggested that I > could look at the TermVectors and do a comparision to remove the > duplicates. As an alternative you could also have a look at the paper "Detecting Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark Manasse, Marc Najork.
Another good reference would be Soumen Chakrabarti's reference book, "Mining the Web - Discovering Knowledge from Hypertext Data",2003 and the section on shingling and the elimination of near duplicates. Of course I think this works at the document level rather than at the term vector level but it might be useful to prevent duplicate documents from being indexed altogether.
One major problem with this is the structure of the document is > no longer important. Are there any obvious pitfalls? For example: Document > A being a subset of Document B but in no particular order. I think this case is pretty unlikely. But I am not sure whether you can detect near duplicates by only comparing term-document vectors. There might be problems with documents with slightly changed words, words that were replaced with synonyms... However, if you want to keep some information on the word order, you might consider comparing n-gram document vectors. That is, each dimension in the vector does not only represent one word but a sequence of 2, 3, 4, 5... words.
would this involve something like a window of 2-5 words around a particular term in a document? Cheers,
Isabel