Re: near duplicates

karl wettin Tue, 17 Oct 2006 09:37:34 -0700


17 okt 2006 kl. 17.54 skrev Find Me:

How to eliminate near duplicates from the index?

I would probably try to measure the Ecludian distance between alldocuments, computed on terms and their positions. Or perhaps usestandard deviation to find the distribution of terms in a document.One would based on the output from that try to find a threashold.Either way it will consume lots of CPU.




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: near duplicates

Reply via email to