-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:
> Hi Micah,
> 
> First some context…
> We are using wget 1.11.3 to mirror a web site so we can do some offline
> processing on it.  The mirror is on a Solaris 10 x86 server.
> 
> The problem we are getting appears to be because the URLs in the HTML
> pages that are harvested by wget for downloading have mixed case (the
> site we are mirroring is running on a Windows 2000 server using IIS) and
> the directory structure created on the mirror have 'duplicate'
> directories because of the mixed case.
> 
> For example,  the URLs in HTML pages /Senate/committees/index.htm and
> /senate/committees/index.htm refer to the same file but wget creates 2
> different directory structures on the mirror site for these URLs.
> 
> This appears to be a fairly basic thing, but we can't see any wget
> options that allow us to treat URLs case insensetively.
> 
> We don't really want to post-process the site just to merge the files
> and directories with different case.

Unfortunately, nothing really comes to mind. If you'd like, you could
file a feature request at
https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
asking Wget to treat URLs case-insensitively. Finding local files
case-insensitively, on a case-sensitive filesystem, would be a PITA; but
adding and looking up URLs in the internal blacklist hash wouldn't be
too hard. I probably wouldn't get to that for a while, though.

Another useful option might be to change the name of "index" files, so
that, for instance, you could have URLs like http://foo/ result in
"foo/index.htm" or "foo/default.html", rather than "foo/index.html".

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUG937M8hyUobTrERAqq2AJ48mGvcFCSxnouTFqYTuRHzVgwYdgCeLegI
vkdzf3Lu+Vn5diCOHk5CRhc=
=IlG9
-----END PGP SIGNATURE-----

Reply via email to