As you point out, the behavior is up to the spider's implementation. What a good spider "ought" to do (in my opinion) is the following: * It should check the robots file each time. * It should look for the meta tags each time. * It should also keep track of "orphan" pages - pages that are still accessible via the direct URL, but are no longer linked-to by other pages on the site.
I believe all 3 classes of pages should be removed from the index. The third item is an interesting one. I know some spiders do NOT realize that pages are no longer "linked to" and keep indexing them. We decided this was bad. Our assumption (when in doubt, be conservative) is that many webmasters intentionally "unlink" sections of their web site; they may be in the process of updating the pages, or perhaps that part of the site is now obsolete. So a new spider or web surfer will never see the pages. Yes, it's possible a web master might accidentally unlink part of their site. Or perhaps an intervening linking page had retrieval errors. In those rare cases, a "dumb" spider would be preferred - one that indexes the orphan pages anyway. Our spider has retry code so the "intervening page error" incidents should be reduced. You could argue it either way. Again, we err on the side of being extra careful to not include content that a webmaster doesn't want published. I'm curious, we've never really gathered opinions from anybody outside the company on this somewhat obscure bit of spider design. Any comments from you all? Mark ---------------------------------------------------------------------------- ----------------------- Mark L. Bennett CTO / Searchbutton.com, Inc. [EMAIL PROTECTED] (650) 947-8312 Search-enable your website today with Searchbutton.com! -----Original Message----- From: Thomas Witt [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 08, 2000 5:15 AM To: [EMAIL PROTECTED] Subject: Re: What happens once robots are barred? Robots can't be "barred" or banned from a site - whether a bot observes any robots.txt or META directives is entirely at the discretion of the bot operators or developers. At 10:50 AM 3/8/00 -0000, Brian Kelly wrote: >What do robots do if they index a web site and the resources they index >are subsequently made unavailable to robots through use of the >robots.txt file or the META ROBOTS tag? Will they remain in the index, >even though the content may have changed (which spammers could exploit) >or would they be deleted once the robot exclusion was detected? > >Thanks > >Brian Kelly > >-------------------------------------------------------------------- >Brian Kelly, UK Web Focus >UKOLN, University of Bath, BATH, England, BA2 7AY >Email: [EMAIL PROTECTED] URL: http://www.ukoln.ac.uk/ >Homepage: http://www.ukoln.ac.uk/ukoln/staff/b.kelly.html >Phone: 01225 323943 FAX: 01225 826838 >