on 11/30/02 11:02, [EMAIL PROTECTED] purportedly said:

> Yes I found that there are a lot of bad links. I think they are more than
> the good links.
> 
> Yes you're right that URL was not a real one. It doesn't always have the
> part with http:// in the PATH_INFO.
> Sometimes it has some directories and a simple local file at the end.
> 
> I can save it with any name but I want to know the real file name because I
> will follow the links from that page, and that path will be their base url.
> If the links from that page will be relative links like "dir/file.html" and
> I won't find the exact level depth of the directories, I won't be able to
> browse the pages locally without errors.

There should not be any errors, unless the site is broken, in which case you
can't follow the link anyway. Remember that *no* web client, including all
web browsers available today (at least none that I have ever worked with),
understand PATH_INFO. If you are getting errors, it is likely because of
your code, and not the site. Take the following example:

    http://www.site.com/script.cgi/dir1/file.html.

Say there is a link on the returned page to "file2.html". Its URL will be:

    http://www.site.com/script.cgi/dir1/file2.html

Which is how it should be, and the site should respond without error. If it
responds with error, it should also do so if you are using any web browser
if your code is correct.

If the returned page has the link "/images/image.gif", its URL will be:

    http://www.site.com/images/image.gif

You can test this yourself by creating your own script, and see how your
browser behaves.

> I am parsing the URL and the first path segment will have a "." in it will
> be considered as a file.
> 
> I am considering that the directories don't use a "." but ...

Unfortunately, this is not reliable since any site which happens to use a
dot in a directory name will break your spider script.

Keary Suska
Esoteritech, Inc.
"Leveraging Open Source for a better Internet"

Reply via email to