17 okt 2006 kl. 17.54 skrev Find Me:
How to eliminate near duplicates from the index?
I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms in a document. One would based on the output from that try to find a threashold. Either way it will consume lots of CPU.
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]