Yes I found that there are a lot of bad links. I think they are more than the good links.
Yes you're right that URL was not a real one. It doesn't always have the part with http:// in the PATH_INFO. Sometimes it has some directories and a simple local file at the end. I can save it with any name but I want to know the real file name because I will follow the links from that page, and that path will be their base url. If the links from that page will be relative links like "dir/file.html" and I won't find the exact level depth of the directories, I won't be able to browse the pages locally without errors. I found a workaround for this problem. It might not always work, but... I am parsing the URL and the first path segment will have a "." in it will be considered as a file. I am considering that the directories don't use a "." but ... Teddy, Teddy's Center: http://teddy.fcc.ro/ Email: [EMAIL PROTECTED] ----- Original Message ----- From: "Keary Suska" <[EMAIL PROTECTED]> To: "Libwww Perl" <[EMAIL PROTECTED]> Sent: Thursday, November 28, 2002 9:16 PM Subject: Re: Help! how is this called? on 11/27/02 7:54, [EMAIL PROTECTED] purportedly said: > RE: Help! how is this called?Thank you but this won't help me I guess. > > I could find that info only from within the script, right? > > Well, I want to create a program like that Teleport Pro from Windows that > spiders a web site and download all the pages from the site. > To download the pages is very easy, but the biggest problem is to create the > local file names, and to replace all the links from the downloaded pages to > make them work locally. > > Until now, the only problem I found, is that I can't reliably find the file > name from the path in all the cases. Well, yes and no. The example URL provided: > http://www.site.com/script.cfm/dir1/dir2/http://www.site.com/file.html is technically a malformed URI. It should be: http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com%2Ffile.html or minimally: http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com/file.html You will always find that sites do stupid things, and will have to find ways around them. However, the case of extra PATH_INFO or query strings, it doesn't hurt to treat them as they are, and you will be successful most of the time. Other than issues with the URI above, you should have minimal problems. Keary Suska Esoteritech, Inc. "Leveraging Open Source for a better Internet"
