As the Original Robots.txt standard is 7 years old I agree it is time
for a review and maybe extending the functionality of the robots.txt
file.

Maybe something along the lines of the following additions: -

        Version: [number]
                1       - original robots.txt specification syntax
                2       - future extended robots.txt specification
                blank - falls back to version 

                Sugested incase for example Version 2 extends the
already 
                existing Allow: line

        Interval: [number] [interval]
                [number]        - numerical value
                [interval]      - h[ours], d[ays], w[eeks], m[onths],
and y[ears].

                As suggested by Fred Atkinson

        AllowTypes: [mimetypes]
                A list of mime-types the crawler is allowed to retrieve.
No 
                longer are we only indexing text/html pages, but also
PDF, MS 
                Word, etc

                e.g. 
        
                        # Allow only the following document types
                        AllowTypes: text/html, text/xml, text/plain,
image/jpeg

        BlockTypes: [mimetypes]
                A list of mime-types the crawler is allowed to retrieve.

                e.g. 

                        # Do not index PDF Files and MSWord files, All
others are 
                        # allowed.
                        BlockTypes: application/pdf, application/msword

        AllowExtension: [extensionlist]
                A list of filename extensions we should include (maybe
mime 
                types have not been configured correctly on the server,
or 
                we use .exe files to display html pages with CGI etc)

                e.g.

                        # Allow the following extensions to be indexed.
                        AllowExtension: .html, .php, .pl

        BlockExtension: [extensionlist]
                A list of filename extensions we should exclude

The possibilities are endless but each would give better control over
how there sites are indexed.

/PT

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Fred Atkinson
Sent: 10 January 2004 17:52
To: Robots
Subject: [Robots] Robots.txt Evolution?

Hi,

    I've just subscribed to the robots.txt list.  I read the messages
posted through December.

    My question is if there is going to be any evolution to the
robots.txt coding?  It is very limited at present.  I can think of a few
things they could incorporate that would make it bettter and I'm sure
the rest of you could, too.

    I've had robots.txt files on my sites for years.  When I recently
researched to see what had changed, it doesn't appear that there is
anything new on the horizon.

    I've got two robots completely blocked out of my system.  One is
Scooter, which is Alta Vista's robot.  When I initially put it in as
disallowed, it was because Scooter was hitting my site several times a
day.

    I don't mind them scanning me once in a while to get listings for
search engines, but I do object to them hammering my site that
frequently.

    Should not there be coding to tell either a particular robot or
group of robots how often they are allowed to scan my site?  Maybe a
line like (and this is arbitrary):

User-agent: Scooter
Interval: 30d
Disallow: /whatever your want them not to scan when they do come in.

    This would tell Scooter that he is not to scan again until thirty
days after the last scan.  There could be codes like h[ours], d[ays],
w[eeks], m[onths], and y[ears].

    Just an idea I had.

    Feedback?


                                                            Fred

_______________________________________________
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots


_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to