Ah, it must have been the content check that was case sensitive.
This link was on both pages, but one had a capital letter C, the other a
lowercase c:
<a
href="/general/travel/staff1.nsf/$lookup/Contents?OpenDocument&Start=1&
<a
href="/general/travel/staff1.nsf/$lookup/contents?OpenDocument&Start=1&
If you're using md5sum, presumably it will be case sensitive here. It'd be
good if the dedup could handle that though.
Thanks for the info below by the way.
JS.
Chirag Chaman wrote:
They may in fact be two different URLs -- unix/Linux would treat them are
separate paths. Example, we use the capitalization as a hashing mechanism.
Thus, 90% of the time I don't think it will be a problem.
That being said, the content check should have flagged it for removal
during
the merge, so you wont end up with dups even if the URLs are not the same.
Deduplication (as implemented in DeleteDuplicates and SegmentMergeTool)
checks both the URL and the content md5. The algorithm works like this:
* if two pages have the same url, keep the newer one.
* if two pages have the same content, keep the one with the shorter url.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com