On Thu, Nov 28, 2002 at 12:16:19PM -0700, Keary Suska wrote:
> on 11/27/02 7:54, [EMAIL PROTECTED] purportedly said:
> 
> > RE: Help! how is this called?Thank you but this won't help me I guess.
> > 
> > I could find that info only from within the script, right?
> > 
> > Well, I want to create a program like that Teleport Pro from Windows that
> > spiders a web site and download all the pages from the site.
> > To download the pages is very easy, but the biggest problem is to create the
> > local file names, and to replace all the links from the downloaded pages to
> > make them work locally.
> > 
> > Until now, the only problem I found, is that I can't reliably find the file
> > name from the path in all the cases.

I have written a couple of programs that do this.
You don't really need to know a file name, but you do need to weed out
duplicates, e.g. [...]/foo/ is often identical to [...]/foo/index.html.
 
> Well, yes and no. The example URL provided:
> 
> > http://www.site.com/script.cfm/dir1/dir2/http://www.site.com/file.html
> 
> is technically a malformed URI.

According to RFC 2396, it isn't.  A : is allowed anywhere in the path,
and a // is allowed to appear multiple times as well, as far as I can see.
(The / characters separate "segments", and segments can be empty.)

Som browsers (at least IE 6 and links) misparse such URLs,
but they have no excuse, as far as I can see.

> It should be:
> 
> http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com%2Ffile.html
> 
> or minimally:
> 
> http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com/file.html

That would mean you'd have to rewrite the URLs that point to them
from other documents so they won't break.

> You will always find that sites do stupid things, and will have to find ways
> around them. However, the case of extra PATH_INFO or query strings, it
> doesn't hurt to treat them as they are, and you will be successful most of
> the time.
> 
> Other than issues with the URI above, you should have minimal problems.

Right now I have the problem that Apache 2 won't feed URLs to
script.php (in my case it's a PHP script) if they have an extra path.
But this is just one of my regular quarrels with the Apache
configuration file mess, I expect it can be done somehow.

-- 
Reinier Post
TU Eindhoven

Reply via email to