According to [EMAIL PROTECTED]: > At 1:01 PM +0200 8/14/02, pp wrote: > >I got robots.txt like this: > > > >User-agent: * > >Disallow: /page > > > >This should disallow all robots index all pages within /page. > >Right? > > Nope, you should disallow an entire directory with a slash at the > end, like this: /page/
Actually, the trailing slash isn't needed, unless it's to avoid also disallowing something like /page.html. The standard essentially describes a simple substring match, which is what htdig implements. See http://www.robotstxt.org/wc/norobots.html#format I think Geoff's explanation, about URLs already being in the database before being disallowed in robots.txt, seems like the most plausible one. We haven't heard back from Piotras as to whether reindexing from scratch fixed the problem. > Check your robots formatting using a checker: > > http://www.searchengineworld.com/cgi-bin/robotcheck.cgi > http://www.tardis.ed.ac.uk/%7Esxw/robots/check/ > http://www.ukoln.ac.uk/web-focus/webwatch/services/robots-txt/ > > More information at > > <http://www.robotstxt.org/wc/exclusion-admin.html> > <http://www.searchtools.com/robots/robots-txt.html> Good to know, thanks! -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

