According to [EMAIL PROTECTED]:
> At 1:01 PM +0200 8/14/02, pp wrote:
> >I got robots.txt like this:
> >
> >User-agent: *
> >Disallow: /page
> >
> >This should disallow all robots index all pages within /page.
> >Right?
> 
> Nope, you should disallow an entire directory with a slash at the 
> end, like this: /page/

Actually, the trailing slash isn't needed, unless it's to avoid also
disallowing something like /page.html.  The standard essentially describes
a simple substring match, which is what htdig implements.

See http://www.robotstxt.org/wc/norobots.html#format

I think Geoff's explanation, about URLs already being in the database
before being disallowed in robots.txt, seems like the most plausible one.
We haven't heard back from Piotras as to whether reindexing from scratch
fixed the problem.

> Check your robots formatting using a checker:
> 
> http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
> http://www.tardis.ed.ac.uk/%7Esxw/robots/check/
> http://www.ukoln.ac.uk/web-focus/webwatch/services/robots-txt/
> 
> More information at
> 
> <http://www.robotstxt.org/wc/exclusion-admin.html>
> <http://www.searchtools.com/robots/robots-txt.html>

Good to know, thanks!

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to