Find Me wrote:
> How to eliminate near duplicates from the index? Someone suggested that I
> could look at the TermVectors and do a comparision to remove the
> duplicates.

As an alternative you could also have a look at the paper "Detecting 
Phrase-Level Duplication on the World Wide Web" by Dennis Fetterly, Mark 
Manasse, Marc Najork. 


> One major problem with this is the structure of the document is 
> no longer important. Are there any obvious pitfalls? For example: Document
> A being a subset of Document B but in no particular order.

I think this case is pretty unlikely. But I am not sure whether you can detect 
near duplicates by only comparing term-document vectors. There might be 
problems with documents with slightly changed words, words that were replaced 
with synonyms...

However, if you want to keep some information on the word order, you might 
consider comparing n-gram document vectors. That is, each dimension in the 
vector does not only represent one word but a sequence of 2, 3, 4, 5... 
words.

Cheers,
Isabel

-- 
QOTD: Knucklehead: "Knock, knock" Pee Wee: "Who's there?" Knucklehead: "Little 
ol' lady." Pee Wee: "Liddle ol' lady who?" Knucklehead: "I didn't know you 
could yodel" 
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_   VoIP:    <sip://[EMAIL PROTECTED]>
 |,4-  ) )-,_..;\ (  `'-'  Jabber: <xmpp://[EMAIL PROTECTED]>
'---''(_/--'  `-'\_) (fL)  Kein ToFu:  <http://learn.to/quote>

Attachment: pgp4xSNGXUrWu.pgp
Description: PGP signature

Reply via email to