As the Original Robots.txt standard is 7 years old I agree it is time
for a review and maybe extending the functionality of the robots.txt
file.
Maybe something along the lines of the following additions: -
Version: [number]
1 - original robots.txt specification syntax
2 - future extended robots.txt specification
blank - falls back to version
Sugested incase for example Version 2 extends the
already
existing Allow: line
Interval: [number] [interval]
[number] - numerical value
[interval] - h[ours], d[ays], w[eeks], m[onths],
and y[ears].
As suggested by Fred Atkinson
AllowTypes: [mimetypes]
A list of mime-types the crawler is allowed to retrieve.
No
longer are we only indexing text/html pages, but also
PDF, MS
Word, etc
e.g.
# Allow only the following document types
AllowTypes: text/html, text/xml, text/plain,
image/jpeg
BlockTypes: [mimetypes]
A list of mime-types the crawler is allowed to retrieve.
e.g.
# Do not index PDF Files and MSWord files, All
others are
# allowed.
BlockTypes: application/pdf, application/msword
AllowExtension: [extensionlist]
A list of filename extensions we should include (maybe
mime
types have not been configured correctly on the server,
or
we use .exe files to display html pages with CGI etc)
e.g.
# Allow the following extensions to be indexed.
AllowExtension: .html, .php, .pl
BlockExtension: [extensionlist]
A list of filename extensions we should exclude
The possibilities are endless but each would give better control over
how there sites are indexed.
/PT
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Fred Atkinson
Sent: 10 January 2004 17:52
To: Robots
Subject: [Robots] Robots.txt Evolution?
Hi,
I've just subscribed to the robots.txt list. I read the messages
posted through December.
My question is if there is going to be any evolution to the
robots.txt coding? It is very limited at present. I can think of a few
things they could incorporate that would make it bettter and I'm sure
the rest of you could, too.
I've had robots.txt files on my sites for years. When I recently
researched to see what had changed, it doesn't appear that there is
anything new on the horizon.
I've got two robots completely blocked out of my system. One is
Scooter, which is Alta Vista's robot. When I initially put it in as
disallowed, it was because Scooter was hitting my site several times a
day.
I don't mind them scanning me once in a while to get listings for
search engines, but I do object to them hammering my site that
frequently.
Should not there be coding to tell either a particular robot or
group of robots how often they are allowed to scan my site? Maybe a
line like (and this is arbitrary):
User-agent: Scooter
Interval: 30d
Disallow: /whatever your want them not to scan when they do come in.
This would tell Scooter that he is not to scan again until thirty
days after the last scan. There could be codes like h[ours], d[ays],
w[eeks], m[onths], and y[ears].
Just an idea I had.
Feedback?
Fred
_______________________________________________
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots