This is good advice. More generally, there can be many names for the same document, mirrors that have completely different hosts & domains in the url. My indexer keeps a mirror store that is indexed by a signature including the document size, checksum, and term count. This is beneficial because in addition to uniquely identifying documents on the web, it also allows us to identify duplicated documents on the LAN.
-----Original Message----- From: Thomas Witt [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 21, 2001 1:06 PM To: [EMAIL PROTECTED] Subject: [Robots] Re: Correct URL, shlash at the end ? You may have more than just two scans on the resource, as urls such as http://www.abc.de/xyz/index.html will also return the same document. Calculate a checksum for each url retrieved, and compare for identical checksums. If you find that one page is identical to another, the second can either be ignored, or contain a pointer reference to the first. At 03:00 PM 11/21/01 +0100, Matthias Jaekle wrote: > >Hello, > >I read about adding a slash at the end of the URLs, if there is no >absolut path present. > >But what about pathes ending in subdirectories (xyz). >A link to http://www.abc.de/xyz/ might be more correct then the link >to http://www.abc.de/xyz > >But is there a possibility to find out if somebody who was writing >http://www.abc.de/xyz is meaning http://www.abc.de/xyz/ > >In my database of scanned urls I found both versions, so I believe I >analysed many files twice. > >How do I handle this circumstance correctly ? > >Many thanks for your help > >Matthias > > > > >-- >This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]". > -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]". -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".