Ok, I need to explain you a little better to understand.

I am on dial-up, paying per minute.
I want to make a script that download a list of web sites on another
computer which is connected to internet all the time, create a tar.gz
archive with those files and directories, and after that I could download
that archive in a very small period of time.

So I will read those web sites offline with no connection to internet.
I need to replace any absolute and relative link from all the pages to a
relative version that works locally.

The conclusion until now is that I can't make it always work because the web
server knows if a file is file or directory, but it doesn't tell me that.


Teddy,
Teddy's Center: http://teddy.fcc.ro/
Email: [EMAIL PROTECTED]

----- Original Message -----
From: "Keary Suska" <[EMAIL PROTECTED]>
To: "Libwww Perl" <[EMAIL PROTECTED]>
Sent: Saturday, November 30, 2002 11:03 PM
Subject: Re: Help! how is this called?


on 11/30/02 11:02, [EMAIL PROTECTED] purportedly said:

> Yes I found that there are a lot of bad links. I think they are more than
> the good links.
>
> Yes you're right that URL was not a real one. It doesn't always have the
> part with http:// in the PATH_INFO.
> Sometimes it has some directories and a simple local file at the end.
>
> I can save it with any name but I want to know the real file name because
I
> will follow the links from that page, and that path will be their base
url.
> If the links from that page will be relative links like "dir/file.html"
and
> I won't find the exact level depth of the directories, I won't be able to
> browse the pages locally without errors.

There should not be any errors, unless the site is broken, in which case you
can't follow the link anyway. Remember that *no* web client, including all
web browsers available today (at least none that I have ever worked with),
understand PATH_INFO. If you are getting errors, it is likely because of
your code, and not the site. Take the following example:

    http://www.site.com/script.cgi/dir1/file.html.

Say there is a link on the returned page to "file2.html". Its URL will be:

    http://www.site.com/script.cgi/dir1/file2.html

Which is how it should be, and the site should respond without error. If it
responds with error, it should also do so if you are using any web browser
if your code is correct.

If the returned page has the link "/images/image.gif", its URL will be:

    http://www.site.com/images/image.gif

You can test this yourself by creating your own script, and see how your
browser behaves.

> I am parsing the URL and the first path segment will have a "." in it will
> be considered as a file.
>
> I am considering that the directories don't use a "." but ...

Unfortunately, this is not reliable since any site which happens to use a
dot in a directory name will break your spider script.

Keary Suska
Esoteritech, Inc.
"Leveraging Open Source for a better Internet"



Reply via email to