[Robots] Re: Correct URL, shlash at the end ?

George Phillips 23 Nov 2001 02:05:20 -0000


Sean 'Captain Napalm' Conner wrote:
> 
> It was thus said that the Great [EMAIL PROTECTED] once stated:
> >
> > It would be great to know how to ask that http service for the list of
> > "default or index file" names so the agents could verify what file name was
> > indeed associated with the "/" slash.  We could then put the file name on the
> > URL to completely qualify that URL path.   Anyone?
> 
>   No.  In some pathological cases there *isn't* a file associated with what
> looks like a directory.


Pathological?  They're as normal as any other web server I assure you.
But your point is well taken and allow me to underscore it.  All that
stuff that appears in an HTTP URL after the host name and (optional)
port number is just a bunch of stuff the browser can send to the host
to get more stuff back.  Sure, we call it a URL path and it looks like
a file path but (this is where it gets zenny) there is no file path.

Don't be mislead by relative URLs.  Yes, they use "." and "..".  Yes,
"/" is very important.  Yes, they operate almost identically to
UNIX relative paths (but different enough to keep us on our toes).
Yes, they are extremely useful.  But they're just rules that take
the stuff you used to get the current page and some relative stuff to
construct new stuff -- all done by the browser.  The web server only
understands pure, unadulterated, unrelative stuff.

Mind you, I do agree that we get led down the garden path when we're
told that "http://somesite.com"; needs to be written as "http://somesite.com/";
to be proper.  Because otherwise you have no stuff to send to somesite.com:80.
Which didn't have to be a problem.  "GET\r\n" should work just as well as
"GET /\r\n", no?  But that slash is just too darn handy in separating the
stuff from the host or port number.  Either way, we're now primed into
thinking that every trailing slash has (or does not have) meaning.

I don't want to get too W3C about this.  The opacity of the URL path is
a great design point, but I do recognize that pragmatically it is almost
always a file path and optimizations abound.  I don't mind peeking behind
an interface as long as I know exactly what rules I'm breaking.

The issue of changing robots.txt is a quagmire unto itself.  Let's avoid that
for a moment by recognizing any robots.txt feature can simply be part of
a robot's ordinary configuration and judged purely by its utility.  A spider
I used to work on did exactly this by accident.  You could specify the path
for robots.txt on a site (default "/robots.txt").  Want to ignore it, then
leave the box blank.  Much later someone clever realized that, because we
used general URL concatenation routines, a full URL could be input as
the robots.txt path.  Rather embarrassing to realize that we need not have
built our own path inclusion/exclusion mechanisms.  All we needed to do was
allow for multiple robots.txt URLs (relative or otherwise) per site.

My suggestion is that the robot construct URLs with care -- always do what
a browser would do and respect the fact that the HTTP server may need
exactly the same stuff back as it put into the HTML.  And always, always
store exactly the URL used to retrieve a block of content.  But implement
some generic mechanism to generalize URL equality beyond strcmp().  Regular
expression search and replace looks as promising as anything.  Imagine something
like this (with perlish regexp):

URL-same:       s'/(index|default).html?$'/'

In other words, if the URL ends in "/index.html", "/default.html", "/index.htm" or
"/default.htm" then drop all but the slash and we'll assume the URL will boil
down to the same content.

Here's a starter set:

URL-same:       s'^http://([^/]+)'http://\L$1'  # lower-case host name
URL-same:       s'http://([^/]+)/$'http://$1'   # trailing slash elimiation for "root"
URL-same:       s':80/''                # don't need to see default port number

Here's some if you're dealing with most file servers:

URL-same:       s'[^/]+/..(/|$)''       # condense ".."
URL-same:       tr'A-Z'a-z'             # case fold the whole thing 'cause why not?
URL-same:       s'%7[Ee]'~'g            # one of the more annoying escapes

And something for the "pathological sites"

URL-same:       s'^(http://boston.conman.org/.*/)0+)'$1'g
URL-same:       s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g

For the truly cavalier:

URL-same:       s'^http://www\.'http://'

Well, OK, these regexps are all off the top of my head and I know some of them
have problems, but I hope you get the gist of it.

It would be so cool if a robot could discover these patterns for itself.  Seems
like it would be a small scale version of covering boston.conman.org's other
"problem" of multiple overlapping data views.

                                -- George

--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send "help" in the body of a message 
to "[EMAIL PROTECTED]".

[Robots] Re: Correct URL, shlash at the end ?

Reply via email to