> From [EMAIL PROTECTED] Wed Mar  8 15:32:57 2000
> MIME-Version: 1.0
> Date:         Wed, 8 Mar 2000 11:59:40 -0800
> From: Mark Bennett <[EMAIL PROTECTED]>
> Subject:      Re: What happens once robots are barred?
> Comments: To: Internet robots discussion <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
>
> As you point out, the behavior is up to the spider's implementation.
>
> What a good spider "ought" to do (in my opinion) is the following:
> * It should check the robots file each time.
Since it's unlikely that this file changes every few moinutes, IMHO, fetching
robots file every few pages is less wear and tear on a server.

> * It should look for the meta tags each time.
> * It should also keep track of "orphan" pages - pages that are still
> accessible via the direct URL, but are no longer linked-to by other pages on
> the site.
I have worked for a few content providers in the past. We may have a web site
with "official" content and a few "orphaned" directories for those in
 the "know" that are more technically oriented, rather like anonymous
 FTP. Some of these are of a more personal nature. Other sites
may link to them, or they may not. I don't think that there are any rules.

>
> I believe all 3 classes of pages should be removed from the index.
>
> The third item is an interesting one.  I know some spiders do NOT realize
> that pages are no longer "linked to" and keep indexing them.
>
> We decided this was bad.  Our assumption (when in doubt, be conservative) is
> that many webmasters intentionally "unlink" sections of their web site; they
> may be in the process of updating the pages, or perhaps that part of the
> site is now obsolete.  So a new spider or web surfer will never see the
> pages.
>
> Yes, it's possible a web master might accidentally unlink part of their
> site.  Or perhaps an intervening linking page had retrieval errors.  In
> those rare cases, a "dumb" spider would be preferred - one that indexes the
> orphan pages anyway.  Our spider has retry code so the "intervening page
> error" incidents should be reduced.
>
> You could argue it either way.  Again, we err on the side of being extra
> careful to not include content that a webmaster doesn't want published.
>
> I'm curious, we've never really gathered opinions from anybody outside the
> company on this somewhat obscure bit of spider design.  Any comments from
> you all?
>
> Mark
>
> ----------------------------------------------------------------------------
> -----------------------
> Mark L. Bennett
> CTO / Searchbutton.com, Inc.
> [EMAIL PROTECTED]
> (650) 947-8312
>
> Search-enable your website today with Searchbutton.com!
>
> -----Original Message-----
> From: Thomas Witt [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, March 08, 2000 5:15 AM
> To: [EMAIL PROTECTED]
> Subject: Re: What happens once robots are barred?
>
> Robots can't be "barred" or banned from a site - whether a bot observes any
> robots.txt or META directives is entirely at the discretion of the bot
> operators or developers.
>
> At 10:50 AM 3/8/00 -0000, Brian Kelly wrote:
> >What do robots do if they index a web site and the resources they index
> >are subsequently made unavailable to robots through use of the
> >robots.txt file or the META ROBOTS tag?  Will they remain in the index,
> >even though the content may have changed (which spammers could exploit)
> >or would they be deleted once the robot exclusion was detected?
> >
> >Thanks
> >
> >Brian Kelly
> >
> >--------------------------------------------------------------------
> >Brian Kelly, UK Web Focus
> >UKOLN, University of Bath, BATH, England, BA2 7AY
> >Email:  [EMAIL PROTECTED]     URL:    http://www.ukoln.ac.uk/
> >Homepage: http://www.ukoln.ac.uk/ukoln/staff/b.kelly.html
> >Phone:  01225 323943            FAX:   01225 826838
> >
>

Reply via email to