On Thu, Nov 28, 2002 at 12:16:19PM -0700, Keary Suska wrote: > on 11/27/02 7:54, [EMAIL PROTECTED] purportedly said: > > > RE: Help! how is this called?Thank you but this won't help me I guess. > > > > I could find that info only from within the script, right? > > > > Well, I want to create a program like that Teleport Pro from Windows that > > spiders a web site and download all the pages from the site. > > To download the pages is very easy, but the biggest problem is to create the > > local file names, and to replace all the links from the downloaded pages to > > make them work locally. > > > > Until now, the only problem I found, is that I can't reliably find the file > > name from the path in all the cases.
I have written a couple of programs that do this. You don't really need to know a file name, but you do need to weed out duplicates, e.g. [...]/foo/ is often identical to [...]/foo/index.html. > Well, yes and no. The example URL provided: > > > http://www.site.com/script.cfm/dir1/dir2/http://www.site.com/file.html > > is technically a malformed URI. According to RFC 2396, it isn't. A : is allowed anywhere in the path, and a // is allowed to appear multiple times as well, as far as I can see. (The / characters separate "segments", and segments can be empty.) Som browsers (at least IE 6 and links) misparse such URLs, but they have no excuse, as far as I can see. > It should be: > > http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com%2Ffile.html > > or minimally: > > http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com/file.html That would mean you'd have to rewrite the URLs that point to them from other documents so they won't break. > You will always find that sites do stupid things, and will have to find ways > around them. However, the case of extra PATH_INFO or query strings, it > doesn't hurt to treat them as they are, and you will be successful most of > the time. > > Other than issues with the URI above, you should have minimal problems. Right now I have the problem that Apache 2 won't feed URLs to script.php (in my case it's a PHP script) if they have an extra path. But this is just one of my regular quarrels with the Apache configuration file mess, I expect it can be done somehow. -- Reinier Post TU Eindhoven
