On 10/18/06, Isabel Drost <[EMAIL PROTECTED]> wrote:

Find Me wrote:
> How to eliminate near duplicates from the index? Someone suggested that
I
> could look at the TermVectors and do a comparision to remove the
> duplicates.

As an alternative you could also have a look at the paper "Detecting
Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark
Manasse, Marc Najork.


Another good reference would be Soumen Chakrabarti's reference book, "Mining
the Web - Discovering Knowledge from Hypertext Data",2003 and the section on
shingling and the elimination of near duplicates. Of course I think this
works at the document level rather than at the term vector level but it
might be useful to prevent duplicate documents from being indexed
altogether.

One major problem with this is the structure of the document is
> no longer important. Are there any obvious pitfalls? For example:
Document
> A being a subset of Document B but in no particular order.

I think this case is pretty unlikely. But I am not sure whether you can
detect
near duplicates by only comparing term-document vectors. There might be
problems with documents with slightly changed words, words that were
replaced
with synonyms...

However, if you want to keep some information on the word order, you might
consider comparing n-gram document vectors. That is, each dimension in the

vector does not only represent one word but a sequence of 2, 3, 4, 5...
words.



would this involve something like a window of 2-5 words around a particular
term in a document?

Cheers,
Isabel

Reply via email to