Re: unable to remove duplicates

Andrzej Bialecki Mon, 13 Jun 2005 08:45:43 -0700

Chirag Chaman wrote:

They may in fact be two different URLs -- unix/Linux would treat them are
separate paths. Example, we use the capitalization as a hashing mechanism.
Thus, 90% of the time I don't think it will be a problem.


That being said, the content check should have flagged it for removal during
the merge, so you wont end up with dups even if the URLs are not the same.

Deduplication (as implemented in DeleteDuplicates and SegmentMergeTool)checks both the URL and the content md5. The algorithm works like this:


* if two pages have the same url, keep the newer one.

* if two pages have the same content, keep the one with the shorter url.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: unable to remove duplicates

Reply via email to