> From [EMAIL PROTECTED] Wed Mar 8 15:32:57 2000 > MIME-Version: 1.0 > Date: Wed, 8 Mar 2000 11:59:40 -0800 > From: Mark Bennett <[EMAIL PROTECTED]> > Subject: Re: What happens once robots are barred? > Comments: To: Internet robots discussion <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > > As you point out, the behavior is up to the spider's implementation. > > What a good spider "ought" to do (in my opinion) is the following: > * It should check the robots file each time. Since it's unlikely that this file changes every few moinutes, IMHO, fetching robots file every few pages is less wear and tear on a server.
> * It should look for the meta tags each time. > * It should also keep track of "orphan" pages - pages that are still > accessible via the direct URL, but are no longer linked-to by other pages on > the site. I have worked for a few content providers in the past. We may have a web site with "official" content and a few "orphaned" directories for those in the "know" that are more technically oriented, rather like anonymous FTP. Some of these are of a more personal nature. Other sites may link to them, or they may not. I don't think that there are any rules. > > I believe all 3 classes of pages should be removed from the index. > > The third item is an interesting one. I know some spiders do NOT realize > that pages are no longer "linked to" and keep indexing them. > > We decided this was bad. Our assumption (when in doubt, be conservative) is > that many webmasters intentionally "unlink" sections of their web site; they > may be in the process of updating the pages, or perhaps that part of the > site is now obsolete. So a new spider or web surfer will never see the > pages. > > Yes, it's possible a web master might accidentally unlink part of their > site. Or perhaps an intervening linking page had retrieval errors. In > those rare cases, a "dumb" spider would be preferred - one that indexes the > orphan pages anyway. Our spider has retry code so the "intervening page > error" incidents should be reduced. > > You could argue it either way. Again, we err on the side of being extra > careful to not include content that a webmaster doesn't want published. > > I'm curious, we've never really gathered opinions from anybody outside the > company on this somewhat obscure bit of spider design. Any comments from > you all? > > Mark > > ---------------------------------------------------------------------------- > ----------------------- > Mark L. Bennett > CTO / Searchbutton.com, Inc. > [EMAIL PROTECTED] > (650) 947-8312 > > Search-enable your website today with Searchbutton.com! > > -----Original Message----- > From: Thomas Witt [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, March 08, 2000 5:15 AM > To: [EMAIL PROTECTED] > Subject: Re: What happens once robots are barred? > > Robots can't be "barred" or banned from a site - whether a bot observes any > robots.txt or META directives is entirely at the discretion of the bot > operators or developers. > > At 10:50 AM 3/8/00 -0000, Brian Kelly wrote: > >What do robots do if they index a web site and the resources they index > >are subsequently made unavailable to robots through use of the > >robots.txt file or the META ROBOTS tag? Will they remain in the index, > >even though the content may have changed (which spammers could exploit) > >or would they be deleted once the robot exclusion was detected? > > > >Thanks > > > >Brian Kelly > > > >-------------------------------------------------------------------- > >Brian Kelly, UK Web Focus > >UKOLN, University of Bath, BATH, England, BA2 7AY > >Email: [EMAIL PROTECTED] URL: http://www.ukoln.ac.uk/ > >Homepage: http://www.ukoln.ac.uk/ukoln/staff/b.kelly.html > >Phone: 01225 323943 FAX: 01225 826838 > > >
