RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Tony Lewis
Coombe, Allan David (DPS) wrote:

 However, the case of the files on disk is still mixed - so I assume that
 wget is not using the URL it originally requested (harvested from the
 HTML?) to create directories and files on disk.  So what is it using? A
 http header (if so, which one??).

I think wget uses the case from the HTML page(s) for the file name; your
proxy would need to change the URLs in the HTML pages to lower case too.

Tony



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Coombe, Allan David (DPS) wrote:
 
 However, the case of the files on disk is still mixed - so I assume
 that wget is not using the URL it originally requested (harvested
 from the HTML?) to create directories and files on disk.  So what
 is it using? A http header (if so, which one??).
 
 I think wget uses the case from the HTML page(s) for the file name;
 your proxy would need to change the URLs in the HTML pages to lower
 case too.

My understanding from David's post is that he claimed to have been doing
just that:

 I modified the response from the web site to lowercase the urls in
 the html (actually I lowercased the whole response) and the data that
 wget put on disk was fully lowercased - problem solved - or so I
 thought.

My suspicion is it's not quite working, though, as otherwise
where would Wget be getting the mixed-case URLs?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz
Vl7REl1hNVNqdBrLqoygrcE=
=jlBN
-END PGP SIGNATURE-



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-24 Thread Coombe, Allan David (DPS)
Sorry Guys - just an ID 10 T error on my part.

I think I need to change 2 things in the proxy server.

1.  URLs in the HTML being returned to wget - this works OK
2.  The Content-Location header used when the web server reports a
301 Moved Permanently response - I think this works OK.

When I reported that it wasn't working I hadn't done both at the same
time.

Cheers

Allan

-Original Message-
From: Micah Cowan [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 25 June 2008 6:44 AM
To: Tony Lewis
Cc: Coombe, Allan David (DPS); 'Wget'
Subject: Re: Wget 1.11.3 - case sensetivity and URLs


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Coombe, Allan David (DPS) wrote:
 
 However, the case of the files on disk is still mixed - so I assume 
 that wget is not using the URL it originally requested (harvested 
 from the HTML?) to create directories and files on disk.  So what is 
 it using? A http header (if so, which one??).
 
 I think wget uses the case from the HTML page(s) for the file name; 
 your proxy would need to change the URLs in the HTML pages to lower 
 case too.

My understanding from David's post is that he claimed to have been doing
just that:

 I modified the response from the web site to lowercase the urls in the

 html (actually I lowercased the whole response) and the data that wget

 put on disk was fully lowercased - problem solved - or so I thought.

My suspicion is it's not quite working, though, as otherwise where would
Wget be getting the mixed-case URLs?

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIYVyq7M8hyUobTrERAo6mAJ4ylEi5qUZqE7DR8xL2XjWOSfuurACePrIz
Vl7REl1hNVNqdBrLqoygrcE=
=jlBN
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-22 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Coombe, Allan David (DPS) wrote:
 OK - now I am confused.
 
 I found a perl based http proxy (named http::proxy funnily enough)
 that has filters to change both the request and response headers and
 data.  I modified the response from the web site to lowercase the urls
 in the html (actually I lowercased the whole response) and the data that
 wget put on disk was fully lowercased - problem solved - or so I thought.
 
 However, the case of the files on disk is still mixed - so I assume that
 wget is not using the URL it originally requested (harvested from the
 HTML?) to create directories and files on disk.  So what is it using? A
 http header (if so, which one??).

I think you're missing something on your end; I couldn't begin to tell
you what. Running with --debug will likely be informative.

Wget uses the URL that successfully results in a file download. If the
files on disk have mixed case, then it's because it was the result of a
mixed-case request from Wget (which, in turn, must have either resulted
from an explicit argument, or from HTML content).

The only exception to the above is when you explicitly enable
- --content-disposition support, in which case Wget will use any filename
specified in a Content-Disposition header. Those are virtually never
issued, except for CGI-based downloads (and you have to explicitly
enable it).

- --
Good luck!
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIXe0Z7M8hyUobTrERAkF5AJ9FOkx5XQJCx9vkTV9xr2zbYzp4jwCffrec
zhdtjp59GOwt07YgvtolM8o=
=FZ3m
-END PGP SIGNATURE-


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-21 Thread Coombe, Allan David (DPS)
OK - now I am confused.

I found a perl based http proxy (named http::proxy funnily enough)
that has filters to change both the request and response headers and
data.  I modified the response from the web site to lowercase the urls
in the html (actually I lowercased the whole response) and the data that
wget put on disk was fully lowercased - problem solved - or so I
thought.

However, the case of the files on disk is still mixed - so I assume that
wget is not using the URL it originally requested (harvested from the
HTML?) to create directories and files on disk.  So what is it using? A
http header (if so, which one??).

Any ideas??

Cheers
Allan


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Coombe, Allan David (DPS)
Thanks averyone for the contributions.

Ultimately, our purpose is to process documents from the site into our
search database, so probably the most important thing is to limit the
number of files being processed.  The case of  the URLs in the html
probably wouldn't cause us much concern, but I could see that it might
be useful to convert a site for mirroring from a non-case sensetive
(windows) environment to a case sensetive (li|u)nix one - this would
need to include translation of urls in content as well as filenames on
disk.

In the meantime - does anyone know of a proxy server that could
translate urls from mixed case to lower case.  I thought that if we
downloaded using wget via such a proxy server we might get the
appropriate result.  

The other alternative we were thinking of was to post process the files
with symlinks for all mixed case versions of files and directories (I
think someone already suggested this - greate minds and all that...). I
assume that wget would correctly use the symlink to determine the
time/date stamp of the file for determining if it requires updating (or
would it use the time/date stamp of the symlink?). I also assume that if
wget downloaded the file it would overwrite the symlink and we would
have to run our convert files to symlinks process again.

Just to put it in perspective, the actual site is approximately 45gb
(that's what the administrator said) and wget downloaded  100gb
(463,000 files) when I did the first process.

Cheers
Allan

-Original Message-
From: Micah Cowan [mailto:[EMAIL PROTECTED] 
Sent: Saturday, 14 June 2008 7:30 AM
To: Tony Lewis
Cc: Coombe, Allan David (DPS); 'Wget'
Subject: Re: Wget 1.11.3 - case sensetivity and URLs


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 Unfortunately, nothing really comes to mind. If you'd like, you could

 file a feature request at 
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option

 asking Wget to treat URLs case-insensitively.
 
 To have the effect that Allan seeks, I think the option would have to 
 convert all URIs to lower case at an appropriate point in the process.

 I think you probably want to send the original case to the server 
 (just in case it really does matter to the server). If you're going to

 treat different case URIs as matching then the lower-case version will

 have to be stored in the hash. The most important part (from the 
 perspective that Allan voices) is that the versions written to disk 
 use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
real cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the external command hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w
a simple url-rewriting conf should fix the problem, wihout touch the file system
everything can be done server side

Best Regards

On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS)
[EMAIL PROTECTED] wrote:
 Thanks averyone for the contributions.

 Ultimately, our purpose is to process documents from the site into our
 search database, so probably the most important thing is to limit the
 number of files being processed.  The case of  the URLs in the html
 probably wouldn't cause us much concern, but I could see that it might
 be useful to convert a site for mirroring from a non-case sensetive
 (windows) environment to a case sensetive (li|u)nix one - this would
 need to include translation of urls in content as well as filenames on
 disk.

 In the meantime - does anyone know of a proxy server that could
 translate urls from mixed case to lower case.  I thought that if we
 downloaded using wget via such a proxy server we might get the
 appropriate result.

 The other alternative we were thinking of was to post process the files
 with symlinks for all mixed case versions of files and directories (I
 think someone already suggested this - greate minds and all that...). I
 assume that wget would correctly use the symlink to determine the
 time/date stamp of the file for determining if it requires updating (or
 would it use the time/date stamp of the symlink?). I also assume that if
 wget downloaded the file it would overwrite the symlink and we would
 have to run our convert files to symlinks process again.

 Just to put it in perspective, the actual site is approximately 45gb
 (that's what the administrator said) and wget downloaded  100gb
 (463,000 files) when I did the first process.

 Cheers
 Allan

 -Original Message-
 From: Micah Cowan [mailto:[EMAIL PROTECTED]
 Sent: Saturday, 14 June 2008 7:30 AM
 To: Tony Lewis
 Cc: Coombe, Allan David (DPS); 'Wget'
 Subject: Re: Wget 1.11.3 - case sensetivity and URLs


 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Tony Lewis wrote:
 Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could

 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option

 asking Wget to treat URLs case-insensitively.

 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the process.

 I think you probably want to send the original case to the server
 (just in case it really does matter to the server). If you're going to

 treat different case URIs as matching then the lower-case version will

 have to be stored in the hash. The most important part (from the
 perspective that Allan voices) is that the versions written to disk
 use lower case characters.

 Well, that really depends. If it's doing a straight recursive download,
 without preexisting local files, then all that's really necessary is to
 do lookups/stores in the blacklist in a case-normalized manner.

 If preexisting files matter, then yes, your solution would fix it.
 Another solution would be to scan directory contents for the first name
 that matches case insensitively. That's obviously much less efficient,
 but has the advantage that the file will match at least one of the
 real cases from the server.

 As Matthias points out, your lower-case normalization solution could be
 achieved in a more general manner with a hook. Which is something I was
 planning on introducing perhaps in 1.13 anyway (so you could, say, run
 sed on the filenames before Wget uses them), so that's probably the
 approach I'd take. But probably not before 1.13, even if someone
 provides a patch for it in time for 1.12 (too many other things to focus
 on, and I'd like to introduce the external command hooks as a suite,
 if possible).

 OTOH, case normalization in the blacklists would still be useful, in
 addition to that mechanism. Could make another good addition for 1.13
 (because it'll be more useful in combination with the rename hooks).

 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer,
 and GNU Wget Project Maintainer.
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
 nVYivipui+0TRmmK04kD2JE=
 =OMsD
 -END PGP SIGNATURE-




-- 
-mmw


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w
without touching the file system

On Thu, Jun 19, 2008 at 9:23 AM, mm w [EMAIL PROTECTED] wrote:
 a simple url-rewriting conf should fix the problem, wihout touch the file 
 system
 everything can be done server side

 Best Regards

 On Thu, Jun 19, 2008 at 6:29 AM, Coombe, Allan David (DPS)
 [EMAIL PROTECTED] wrote:
 Thanks averyone for the contributions.

 Ultimately, our purpose is to process documents from the site into our
 search database, so probably the most important thing is to limit the
 number of files being processed.  The case of  the URLs in the html
 probably wouldn't cause us much concern, but I could see that it might
 be useful to convert a site for mirroring from a non-case sensetive
 (windows) environment to a case sensetive (li|u)nix one - this would
 need to include translation of urls in content as well as filenames on
 disk.

 In the meantime - does anyone know of a proxy server that could
 translate urls from mixed case to lower case.  I thought that if we
 downloaded using wget via such a proxy server we might get the
 appropriate result.

 The other alternative we were thinking of was to post process the files
 with symlinks for all mixed case versions of files and directories (I
 think someone already suggested this - greate minds and all that...). I
 assume that wget would correctly use the symlink to determine the
 time/date stamp of the file for determining if it requires updating (or
 would it use the time/date stamp of the symlink?). I also assume that if
 wget downloaded the file it would overwrite the symlink and we would
 have to run our convert files to symlinks process again.

 Just to put it in perspective, the actual site is approximately 45gb
 (that's what the administrator said) and wget downloaded  100gb
 (463,000 files) when I did the first process.

 Cheers
 Allan

 -Original Message-
 From: Micah Cowan [mailto:[EMAIL PROTECTED]
 Sent: Saturday, 14 June 2008 7:30 AM
 To: Tony Lewis
 Cc: Coombe, Allan David (DPS); 'Wget'
 Subject: Re: Wget 1.11.3 - case sensetivity and URLs


 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Tony Lewis wrote:
 Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could

 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option

 asking Wget to treat URLs case-insensitively.

 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the process.

 I think you probably want to send the original case to the server
 (just in case it really does matter to the server). If you're going to

 treat different case URIs as matching then the lower-case version will

 have to be stored in the hash. The most important part (from the
 perspective that Allan voices) is that the versions written to disk
 use lower case characters.

 Well, that really depends. If it's doing a straight recursive download,
 without preexisting local files, then all that's really necessary is to
 do lookups/stores in the blacklist in a case-normalized manner.

 If preexisting files matter, then yes, your solution would fix it.
 Another solution would be to scan directory contents for the first name
 that matches case insensitively. That's obviously much less efficient,
 but has the advantage that the file will match at least one of the
 real cases from the server.

 As Matthias points out, your lower-case normalization solution could be
 achieved in a more general manner with a hook. Which is something I was
 planning on introducing perhaps in 1.13 anyway (so you could, say, run
 sed on the filenames before Wget uses them), so that's probably the
 approach I'd take. But probably not before 1.13, even if someone
 provides a patch for it in time for 1.12 (too many other things to focus
 on, and I'd like to introduce the external command hooks as a suite,
 if possible).

 OTOH, case normalization in the blacklists would still be useful, in
 addition to that mechanism. Could make another good addition for 1.13
 (because it'll be more useful in combination with the rename hooks).

 - --
 Micah J. Cowan
 Programmer, musician, typesetting enthusiast, gamer,
 and GNU Wget Project Maintainer.
 http://micah.cowan.name/
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.6 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
 nVYivipui+0TRmmK04kD2JE=
 =OMsD
 -END PGP SIGNATURE-




 --
 -mmw




-- 
-mmw


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread Tony Lewis
mm w wrote:

 a simple url-rewriting conf should fix the problem, wihout touch the file 
 system
 everything can be done server side

Why do you assume the user of wget has any control over the server from which 
content is being downloaded?



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-19 Thread mm w
not al, but in this particular case I pretty sure they have

On Thu, Jun 19, 2008 at 10:42 AM, Tony Lewis [EMAIL PROTECTED] wrote:
 mm w wrote:

 a simple url-rewriting conf should fix the problem, wihout touch the file 
 system
 everything can be done server side

 Why do you assume the user of wget has any control over the server from which 
 content is being downloaded?





-- 
-mmw


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-16 Thread mm w
On Sat, Jun 14, 2008 at 4:30 PM, Tony Lewis [EMAIL PROTECTED] wrote:
 mm w wrote:

 Hi, after all, after all it's only my point of view :D
 anyway,

 /dir/file,
 dir/File, non-standard
 Dir/file, non-standard
 and /Dir/File non-standard

 According to RFC 2396: The path component contains data, specific to the 
 authority (or the scheme if there is no authority component), identifying the 
 resource within the scope of that scheme and authority.

 In other words, those names are well within the standard when the server 
 understands them. As far as I know, there is nothing in Internet standards 
 restricting mixed case paths.

:) read again, nobody does except some punk-head folks

 that's it, if the server manages non-standard URL, it's not my
 concern, for me it doesn't exist

 Oh. I see. You're writing to say that wget should only implement features 
 that are meaningful to you. Thanks for your narcissistic input.

no i'm not such a jerk, a simple grep/sed on the website source to
remove the malicious URL should be fine,
or an HTTP redirection when the  malicious non-standard URL is called

in other hand, if wget changes every links in lowercase, some people
should have the opposite problem
a golden rule: never distributing mixed-case URL (to your users), a
simple respect for them and everything in lower-case


 Tony





-- 
-mmw


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-14 Thread Tony Lewis
mm w wrote:

 Hi, after all, after all it's only my point of view :D
 anyway,
 
 /dir/file,
 dir/File, non-standard
 Dir/file, non-standard
 and /Dir/File non-standard

According to RFC 2396: The path component contains data, specific to the 
authority (or the scheme if there is no authority component), identifying the 
resource within the scope of that scheme and authority.

In other words, those names are well within the standard when the server 
understands them. As far as I know, there is nothing in Internet standards 
restricting mixed case paths.

 that's it, if the server manages non-standard URL, it's not my
 concern, for me it doesn't exist

Oh. I see. You're writing to say that wget should only implement features that 
are meaningful to you. Thanks for your narcissistic input.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could
 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
 asking Wget to treat URLs case-insensitively.

To have the effect that Allan seeks, I think the option would have to convert 
all URIs to lower case at an appropriate point in the process. I think you 
probably want to send the original case to the server (just in case it really 
does matter to the server). If you're going to treat different case URIs as 
matching then the lower-case version will have to be stored in the hash. The 
most important part (from the perspective that Allan voices) is that the 
versions written to disk use lower case characters.

Tony



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread mm w
standard: the URL are case-insensitive

you can adapt your software because some people don't respect standard,
we are not anymore in 90's, let people doing crapy things deal with
their crapy world

Cheers!

On Fri, Jun 13, 2008 at 2:08 PM, Tony Lewis [EMAIL PROTECTED] wrote:
 Micah Cowan wrote:

 Unfortunately, nothing really comes to mind. If you'd like, you could
 file a feature request at
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
 asking Wget to treat URLs case-insensitively.

 To have the effect that Allan seeks, I think the option would have to convert 
 all URIs to lower case at an appropriate point in the process. I think you 
 probably want to send the original case to the server (just in case it really 
 does matter to the server). If you're going to treat different case URIs as 
 matching then the lower-case version will have to be stored in the hash. The 
 most important part (from the perspective that Allan voices) is that the 
 versions written to disk use lower case characters.

 Tony





-- 
-mmw


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Tony Lewis wrote:
 Micah Cowan wrote:
 
 Unfortunately, nothing really comes to mind. If you'd like, you
 could file a feature request at 
 https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an
 option asking Wget to treat URLs case-insensitively.
 
 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the
 process. I think you probably want to send the original case to the
 server (just in case it really does matter to the server). If you're
 going to treat different case URIs as matching then the lower-case
 version will have to be stored in the hash. The most important part
 (from the perspective that Allan voices) is that the versions written
 to disk use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
real cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the external command hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-END PGP SIGNATURE-


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Steven M. Schweda
   In the VMS world, where file name case may matter, but usually
doesn't, the normal scheme is to preserve case when creating files, but
to do case-insensitive comparisons on file names.

From Tony Lewis:

 To have the effect that Allan seeks, I think the option would have to
 convert all URIs to lower case at an appropriate point in the process.

   I think that that's the wrong way to look at it.  Implementation
details like name hashing may also need to be adjusted, but this
shouldn't be too hard.



   Steven M. Schweda   [EMAIL PROTECTED]
   382 South Warwick Street(+1) 651-699-9818
   Saint Paul  MN  55105-2547


RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
mm w wrote:

 standard: the URL are case-insensitive

 you can adapt your software because some people don't respect standard,
 we are not anymore in 90's, let people doing crapy things deal with
 their crapy world

You obviously missed the point of the original posting: how can one 
conveniently mirror a site whose server uses case insensitive names onto a 
server that uses case sensitive names.

If the original site has the URI strings /dir/file, dir/File, Dir/file, 
and /Dir/File, the same local file will be returned. However, wget will treat 
those as unique directories and files and you wind up with four copies.

Allan asked if there is a way to have wget just create one copy and proposed 
one way that might accomplish that goal.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Steven M. Schweda wrote:

 From Tony Lewis:
  To have the effect that Allan seeks, I think the option would have to
  convert all URIs to lower case at an appropriate point in the process.

   I think that that's the wrong way to look at it.  Implementation
 details like name hashing may also need to be adjusted, but this
 shouldn't be too hard.

OK. How would you normalize the names?

Tony



Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread mm w
Hi, after all, after all it's only my point of view :D
anyway,

/dir/file,
dir/File, non-standard
Dir/file, non-standard
and /Dir/File non-standard

that's it, if the server manages non-standard URL, it's not my
concern, for me it doesn't exist


On Fri, Jun 13, 2008 at 3:12 PM, Tony Lewis [EMAIL PROTECTED] wrote:
 mm w wrote:

 standard: the URL are case-insensitive

 you can adapt your software because some people don't respect standard,
 we are not anymore in 90's, let people doing crapy things deal with
 their crapy world

 You obviously missed the point of the original posting: how can one 
 conveniently mirror a site whose server uses case insensitive names onto a 
 server that uses case sensitive names.

 If the original site has the URI strings /dir/file, dir/File, Dir/file, 
 and /Dir/File, the same local file will be returned. However, wget will 
 treat those as unique directories and files and you wind up with four copies.

 Allan asked if there is a way to have wget just create one copy and proposed 
 one way that might accomplish that goal.

 Tony





-- 
-mmw


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-12 Thread Matthias Vill

Hi list!

saddly I couldn't find the E-Mail of Allan (maybe because I'm atached by 
the news gateway) so this is a list-only-post.


Micah Cowan wrote:

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:

Hi Micah,



First some context
We are using wget 1.11.3 to mirror a web site so we can do some offline
processing on it.  The mirror is on a Solaris 10 x86 server.



The problem we are getting appears to be because the URLs in the HTML
pages that are harvested by wget for downloading have mixed case (the
site we are mirroring is running on a Windows 2000 server using IIS) and
the directory structure created on the mirror have 'duplicate'
directories because of the mixed case.



For example,  the URLs in HTML pages /Senate/committees/index.htm and
/senate/committees/index.htm refer to the same file but wget creates 2
different directory structures on the mirror site for these URLs.


Ok... at this point I need to ask whether you try to mirror or just 
backup the site.
The main problem is easy: the moment you want a working mirror you need 
those mixed-case files or rewrite the url to a unique casing.
At this point it seems to be most practical to either introduce a hook 
like --restrict-file-names to modify the name of the local copy and the 
links inside the downloaded files in the same way.
An other option is to create symlinks for the different directory cases. 
That would safe half the overhead, i guess.

To create such a symlink structure you could use the output of 
find /mirror/basedir -type d | sort -f
 hope that helps.

Matthias


Re: Wget 1.11.3 - case sensetivity and URLs

2008-06-11 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Allan,

You'll generally get better results if you post to the mailing list
(wget@sunsite.dk). I've added it to the recipients list.

Coombe, Allan David (DPS) wrote:
 Hi Micah,
 
 First some context…
 We are using wget 1.11.3 to mirror a web site so we can do some offline
 processing on it.  The mirror is on a Solaris 10 x86 server.
 
 The problem we are getting appears to be because the URLs in the HTML
 pages that are harvested by wget for downloading have mixed case (the
 site we are mirroring is running on a Windows 2000 server using IIS) and
 the directory structure created on the mirror have 'duplicate'
 directories because of the mixed case.
 
 For example,  the URLs in HTML pages /Senate/committees/index.htm and
 /senate/committees/index.htm refer to the same file but wget creates 2
 different directory structures on the mirror site for these URLs.
 
 This appears to be a fairly basic thing, but we can't see any wget
 options that allow us to treat URLs case insensetively.
 
 We don't really want to post-process the site just to merge the files
 and directories with different case.

Unfortunately, nothing really comes to mind. If you'd like, you could
file a feature request at
https://savannah.gnu.org/bugs/?func=additemgroup=wget, for an option
asking Wget to treat URLs case-insensitively. Finding local files
case-insensitively, on a case-sensitive filesystem, would be a PITA; but
adding and looking up URLs in the internal blacklist hash wouldn't be
too hard. I probably wouldn't get to that for a while, though.

Another useful option might be to change the name of index files, so
that, for instance, you could have URLs like http://foo/ result in
foo/index.htm or foo/default.html, rather than foo/index.html.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUG937M8hyUobTrERAqq2AJ48mGvcFCSxnouTFqYTuRHzVgwYdgCeLegI
vkdzf3Lu+Vn5diCOHk5CRhc=
=IlG9
-END PGP SIGNATURE-