This is good advice.  More generally, there can be many names for the
same
document, mirrors that have completely different hosts & domains in the
url.
My indexer keeps a mirror store that is indexed by a signature including
the 
document size, checksum, and term count.  This is beneficial because in 
addition to uniquely identifying documents on the web, it also allows us
to 
identify duplicated documents on the LAN.

-----Original Message-----
From: Thomas Witt [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, November 21, 2001 1:06 PM
To: [EMAIL PROTECTED]
Subject: [Robots] Re: Correct URL, shlash at the end ?




You may have more than just two scans on the resource, as urls such as
http://www.abc.de/xyz/index.html will also return the same document.

Calculate a checksum for each url retrieved, and compare for identical
checksums.  If you find that one page is identical to another, the
second
can either be ignored, or contain a pointer reference to the first.

At 03:00 PM 11/21/01 +0100, Matthias Jaekle wrote:
>
>Hello,
>
>I read about adding a slash at the end of the URLs, if there is no
>absolut path present.
>
>But what about pathes ending in subdirectories (xyz).
>A link to http://www.abc.de/xyz/ might be more correct then the link
>to http://www.abc.de/xyz
>
>But is there a possibility to find out if somebody who was writing
>http://www.abc.de/xyz is meaning http://www.abc.de/xyz/
>
>In my database of scanned urls I found both versions, so I believe I
>analysed many files twice.
>
>How do I handle this circumstance correctly ?
>
>Many thanks for your help
>
>Matthias
>
>
>
>
>--
>This message was sent by the Internet robots and spiders discussion
list
([EMAIL PROTECTED]).  For list server commands, send "help" in the
body
of a message to "[EMAIL PROTECTED]".
>

--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]).  For list server commands, send "help" in the
body of a message to "[EMAIL PROTECTED]".

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

Reply via email to