You may have more than just two scans on the resource, as urls such as
http://www.abc.de/xyz/index.html will also return the same document.
Calculate a checksum for each url retrieved, and compare for identical
checksums. If you find that one page is identical to another, the second
can either be ignored, or contain a pointer reference to the first.
At 03:00 PM 11/21/01 +0100, Matthias Jaekle wrote:
>
>Hello,
>
>I read about adding a slash at the end of the URLs, if there is no
>absolut path present.
>
>But what about pathes ending in subdirectories (xyz).
>A link to http://www.abc.de/xyz/ might be more correct then the link
>to http://www.abc.de/xyz
>
>But is there a possibility to find out if somebody who was writing
>http://www.abc.de/xyz is meaning http://www.abc.de/xyz/
>
>In my database of scanned urls I found both versions, so I believe I
>analysed many files twice.
>
>How do I handle this circumstance correctly ?
>
>Many thanks for your help
>
>Matthias
>
>
>
>
>--
>This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]). For list server commands, send "help" in the body
of a message to "[EMAIL PROTECTED]".
>
--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message
to "[EMAIL PROTECTED]".