Yes I found that there are a lot of bad links. I think they are more than
the good links.

Yes you're right that URL was not a real one. It doesn't always have the
part with http:// in the PATH_INFO.
Sometimes it has some directories and a simple local file at the end.

I can save it with any name but I want to know the real file name because I
will follow the links from that page, and that path will be their base url.
If the links from that page will be relative links like "dir/file.html" and
I won't find the exact level depth of the directories, I won't be able to
browse the pages locally without errors.

I found a workaround for this problem.
It might not always work, but...

I am parsing the URL and the first path segment will have a "." in it will
be considered as a file.

I am considering that the directories don't use a "." but ...

Teddy,
Teddy's Center: http://teddy.fcc.ro/
Email: [EMAIL PROTECTED]

----- Original Message -----
From: "Keary Suska" <[EMAIL PROTECTED]>
To: "Libwww Perl" <[EMAIL PROTECTED]>
Sent: Thursday, November 28, 2002 9:16 PM
Subject: Re: Help! how is this called?


on 11/27/02 7:54, [EMAIL PROTECTED] purportedly said:

> RE: Help! how is this called?Thank you but this won't help me I guess.
>
> I could find that info only from within the script, right?
>
> Well, I want to create a program like that Teleport Pro from Windows that
> spiders a web site and download all the pages from the site.
> To download the pages is very easy, but the biggest problem is to create
the
> local file names, and to replace all the links from the downloaded pages
to
> make them work locally.
>
> Until now, the only problem I found, is that I can't reliably find the
file
> name from the path in all the cases.

Well, yes and no. The example URL provided:

> http://www.site.com/script.cfm/dir1/dir2/http://www.site.com/file.html

is technically a malformed URI. It should be:

http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com%2Ffile.html

or minimally:

http://www.site.com/script.cfm/dir1/dir2/http:%2F%2Fwww.site.com/file.html

You will always find that sites do stupid things, and will have to find ways
around them. However, the case of extra PATH_INFO or query strings, it
doesn't hurt to treat them as they are, and you will be successful most of
the time.

Other than issues with the URI above, you should have minimal problems.

Keary Suska
Esoteritech, Inc.
"Leveraging Open Source for a better Internet"



Reply via email to