You brought up an interesting point, the lack of sites using robots.txt,
but I see two related problems:

1) Most sites do not deploy or properly maintain a robots.txt file.
2) Many robots (nascent, malicious, stealthy or otherwise) do not honor
the robots.txt file.

If there is to be a successor to robots.txt, it should be sophisticated
enough to improve both of these problems.  It's only in the best
interest of the robot developer to honor the file if it mitigates
against link funneling or undesirable files.  It's only in the best
interest of the site developer if they want portions of their site
indexed or omitted.

Here's a proposal that I think helps solve both aspects of the problem:
By link funneling I'm referring to links that contain random session
identifiers, causing the same pages to be served up perpetually with
different anchor tags.  Most robots could benefit from a line that
identified these types of urls by their session identifier, for example:
session=jsessionid.  This simple enhancement would benefit both the
robot developers and the site developers.  The robots would no longer
need to identify these urls by manual or automated profiling, they could
simply extract the session identifier from links that matched the mask.
The site developers would prevent the useless traffic that's presently
involved in inferring the random session identifiers.  Extending this
idea, perhaps the specification could allow robots to substitute their
own agent names for the session identifiers.  This would allow for a
loose type of referral tracking.  As a side effect it would also cause
robots that spoof their agent names to implicate competing robots.

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of Fred Atkinson
Sent: Sunday, January 11, 2004 10:45 AM
To: Internet robots, spiders, web-walkers, etc.
Subject: Re: [Robots] Robots.txt Evolution?


    I'm inclined to agree that a second file would probably get
overlooked by bots.  I would imagine it was difficult trying to get
those who run them to respect the first one.

    I was unaware of the 'Allow' command.  Is there a URL that documents
it?

    Also, the use of wildcards when giving paths should be incorporated.
That would greatly reduce the number of path lines that you'd have to
type into and a robot would have to read out of the robots.txt file.
And, wildcards shouldn't be limited to just the end of the path.  You
should be able to use them in the middle as well.  Perhaps the UNIX
brackets could be incorporated ( [0-9]. [a,e.i.o.u]) as matching
characters.

    As far as the number of sites that actually use robots.txt, that
would grow as the strength of the robots.txt coding improved.


                                                        Fred

----- Original Message ----- 
From: "Paul Trippett" <[EMAIL PROTECTED]>
To: "'Internet robots, spiders, web-walkers, etc.'"
<[EMAIL PROTECTED]>
Sent: Sunday, January 11, 2004 8:57 AM
Subject: RE: [Robots] Robots.txt Evolution?


>
> > This would need to be a separate file, probably "robots2.txt".
>
> People who are oblivious to the robots.txt standard already complain 
> about grabbing robots.txt and would thought that getting another file 
> would make them complain a little more. Besides who wants to maintain 
> two files that do the same job. In my view, if it was in another file 
> an extended standard would never be used and then yes, there would be 
> no point.
>
> /pt
>
>
> _______________________________________________
> Robots mailing list
> [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots

_______________________________________________
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to