Re: unable to remove duplicates

J S Mon, 13 Jun 2005 10:13:42 -0700

Ah, it must have been the content check that was case sensitive.

This link was on both pages, but one had a capital letter C, the other alowercase c:

<ahref="/general/travel/staff1.nsf/$lookup/Contents?OpenDocument&Start=1&<ahref="/general/travel/staff1.nsf/$lookup/contents?OpenDocument&Start=1&

If you're using md5sum, presumably it will be case sensitive here. It'd begood if the dedup could handle that though.

Thanks for the info below by the way.

JS.


Chirag Chaman wrote:

They may in fact be two different URLs -- unix/Linux would treat them are
separate paths. Example, we use the capitalization as a hashing mechanism.
Thus, 90% of the time I don't think it will be a problem.

That being said, the content check should have flagged it for removalduring

the merge, so you wont end up with dups even if the URLs are not the same.

Deduplication (as implemented in DeleteDuplicates and SegmentMergeTool)checks both the URL and the content md5. The algorithm works like this:


* if two pages have the same url, keep the newer one.

* if two pages have the same content, keep the one with the shorter url.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: unable to remove duplicates

Reply via email to