As you point out, the behavior is up to the spider's implementation.

What a good spider "ought" to do (in my opinion) is the following:
* It should check the robots file each time.
* It should look for the meta tags each time.
* It should also keep track of "orphan" pages - pages that are still
accessible via the direct URL, but are no longer linked-to by other pages on
the site.

I believe all 3 classes of pages should be removed from the index.

The third item is an interesting one.  I know some spiders do NOT realize
that pages are no longer "linked to" and keep indexing them.

We decided this was bad.  Our assumption (when in doubt, be conservative) is
that many webmasters intentionally "unlink" sections of their web site; they
may be in the process of updating the pages, or perhaps that part of the
site is now obsolete.  So a new spider or web surfer will never see the
pages.

Yes, it's possible a web master might accidentally unlink part of their
site.  Or perhaps an intervening linking page had retrieval errors.  In
those rare cases, a "dumb" spider would be preferred - one that indexes the
orphan pages anyway.  Our spider has retry code so the "intervening page
error" incidents should be reduced.

You could argue it either way.  Again, we err on the side of being extra
careful to not include content that a webmaster doesn't want published.

I'm curious, we've never really gathered opinions from anybody outside the
company on this somewhat obscure bit of spider design.  Any comments from
you all?

Mark

----------------------------------------------------------------------------
-----------------------
Mark L. Bennett
CTO / Searchbutton.com, Inc.
[EMAIL PROTECTED]
(650) 947-8312

Search-enable your website today with Searchbutton.com!

-----Original Message-----
From: Thomas Witt [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, March 08, 2000 5:15 AM
To: [EMAIL PROTECTED]
Subject: Re: What happens once robots are barred?

Robots can't be "barred" or banned from a site - whether a bot observes any
robots.txt or META directives is entirely at the discretion of the bot
operators or developers.

At 10:50 AM 3/8/00 -0000, Brian Kelly wrote:
>What do robots do if they index a web site and the resources they index
>are subsequently made unavailable to robots through use of the
>robots.txt file or the META ROBOTS tag?  Will they remain in the index,
>even though the content may have changed (which spammers could exploit)
>or would they be deleted once the robot exclusion was detected?
>
>Thanks
>
>Brian Kelly
>
>--------------------------------------------------------------------
>Brian Kelly, UK Web Focus
>UKOLN, University of Bath, BATH, England, BA2 7AY
>Email:  [EMAIL PROTECTED]     URL:    http://www.ukoln.ac.uk/
>Homepage: http://www.ukoln.ac.uk/ukoln/staff/b.kelly.html
>Phone:  01225 323943            FAX:   01225 826838
>

Reply via email to