Lucifersam wrote:
Finally - I seem to have a problem with identical pages with different urls
- i.e.

http://website/
http://website/default.htm

I was under the impression that these would be removed by the dedup process,
but this does not seem to be working. Is there something I'm missing?

Most likely the pages are slightly different - you can save them to files, and then run a diff utility to check for differences.


(I
also have a similar problem with the external site as it carries session ids
around in the URL which change - although the content of the duplicate pages
is identical).

You can remove session IDs using URLNormalizers - see e.g. the regex-urlnormalizer.xml for an example how to do this.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to