Sean 'Captain Napalm' Conner wrote: > > It was thus said that the Great [EMAIL PROTECTED] once stated: > > > > It would be great to know how to ask that http service for the list of > > "default or index file" names so the agents could verify what file name was > > indeed associated with the "/" slash. We could then put the file name on the > > URL to completely qualify that URL path. Anyone? > > No. In some pathological cases there *isn't* a file associated with what > looks like a directory.
Pathological? They're as normal as any other web server I assure you. But your point is well taken and allow me to underscore it. All that stuff that appears in an HTTP URL after the host name and (optional) port number is just a bunch of stuff the browser can send to the host to get more stuff back. Sure, we call it a URL path and it looks like a file path but (this is where it gets zenny) there is no file path. Don't be mislead by relative URLs. Yes, they use "." and "..". Yes, "/" is very important. Yes, they operate almost identically to UNIX relative paths (but different enough to keep us on our toes). Yes, they are extremely useful. But they're just rules that take the stuff you used to get the current page and some relative stuff to construct new stuff -- all done by the browser. The web server only understands pure, unadulterated, unrelative stuff. Mind you, I do agree that we get led down the garden path when we're told that "http://somesite.com" needs to be written as "http://somesite.com/" to be proper. Because otherwise you have no stuff to send to somesite.com:80. Which didn't have to be a problem. "GET\r\n" should work just as well as "GET /\r\n", no? But that slash is just too darn handy in separating the stuff from the host or port number. Either way, we're now primed into thinking that every trailing slash has (or does not have) meaning. I don't want to get too W3C about this. The opacity of the URL path is a great design point, but I do recognize that pragmatically it is almost always a file path and optimizations abound. I don't mind peeking behind an interface as long as I know exactly what rules I'm breaking. The issue of changing robots.txt is a quagmire unto itself. Let's avoid that for a moment by recognizing any robots.txt feature can simply be part of a robot's ordinary configuration and judged purely by its utility. A spider I used to work on did exactly this by accident. You could specify the path for robots.txt on a site (default "/robots.txt"). Want to ignore it, then leave the box blank. Much later someone clever realized that, because we used general URL concatenation routines, a full URL could be input as the robots.txt path. Rather embarrassing to realize that we need not have built our own path inclusion/exclusion mechanisms. All we needed to do was allow for multiple robots.txt URLs (relative or otherwise) per site. My suggestion is that the robot construct URLs with care -- always do what a browser would do and respect the fact that the HTTP server may need exactly the same stuff back as it put into the HTML. And always, always store exactly the URL used to retrieve a block of content. But implement some generic mechanism to generalize URL equality beyond strcmp(). Regular expression search and replace looks as promising as anything. Imagine something like this (with perlish regexp): URL-same: s'/(index|default).html?$'/' In other words, if the URL ends in "/index.html", "/default.html", "/index.htm" or "/default.htm" then drop all but the slash and we'll assume the URL will boil down to the same content. Here's a starter set: URL-same: s'^http://([^/]+)'http://\L$1' # lower-case host name URL-same: s'http://([^/]+)/$'http://$1' # trailing slash elimiation for "root" URL-same: s':80/'' # don't need to see default port number Here's some if you're dealing with most file servers: URL-same: s'[^/]+/..(/|$)'' # condense ".." URL-same: tr'A-Z'a-z' # case fold the whole thing 'cause why not? URL-same: s'%7[Ee]'~'g # one of the more annoying escapes And something for the "pathological sites" URL-same: s'^(http://boston.conman.org/.*/)0+)'$1'g URL-same: s'^(http://boston.conman.org/.*\.[0-9]*)0+(/|$)'$1$2'g For the truly cavalier: URL-same: s'^http://www\.'http://' Well, OK, these regexps are all off the top of my head and I know some of them have problems, but I hope you get the gist of it. It would be so cool if a robot could discover these patterns for itself. Seems like it would be a small scale version of covering boston.conman.org's other "problem" of multiple overlapping data views. -- George -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message to "[EMAIL PROTECTED]".
