It was thus said that the Great Matthew Meadows once stated:
> 
> By link funneling I'm referring to links that contain random session
> identifiers, causing the same pages to be served up perpetually with
> different anchor tags.  Most robots could benefit from a line that
> identified these types of urls by their session identifier, for example:
> session=3Djsessionid.  This simple enhancement would benefit both the
> robot developers and the site developers.  The robots would no longer
> need to identify these urls by manual or automated profiling, they could
> simply extract the session identifier from links that matched the mask.
> The site developers would prevent the useless traffic that's presently
> involved in inferring the random session identifiers.  Extending this
> idea, perhaps the specification could allow robots to substitute their
> own agent names for the session identifiers.  This would allow for a
> loose type of referral tracking.  As a side effect it would also cause
> robots that spoof their agent names to implicate competing robots.

  I'm beginning to think that to really extend the robots exclusion
protocol, two new directives need to be defined, Ignore: and Index:, that
work similar to Disallow: and Allow: but allow regular expressions to be
used.  An up-to-date robot could internally convert:

        Disallow: /sooperceecret/

to the equivilent:

        Ignore: ^/soopercecret/.*

So, to avoid session ids, one coudl just do:

        Ignore: .*sessionid=.*

And there you go.  Using the different directives makes it backwards
compatible with the original robots.txt (where an older robot will ignore
the new directives) and without overloading the meaning of existing
directives (one of the downpoints of my own proposed extention).

  So for my hypothetical robots.txt site I mentioned in a previous post, I
could do:

        User-agent: *
        Index: ^/$
        Disallow: /

  or even:

        User-agent: *
        Index: ^/$
        Ignore: .*

  -spc (Likes it ... )

_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to