Re: [Bug-wget] trouble with URL vs local file names
On Fri, Feb 19, 2010 at 10:01:18AM +0100, Tobias Senz wrote: On 19.02.2010 00:44, Andrew Cady wrote: I have written a patch for wget which makes this behavior possible. The very use case that you describe is illustrated in the email to this list which contains the patch: http://lists.gnu.org/archive/html/bug-wget/2010-01/msg00021.html [...] Just to verify, and be 110% certain, your additional renaming happens AFTER all of the built-in escaping / renaming wget would do without the patch? Well, I just tested, and yes: it turns out the --rename modifies the output filename after the --restrict-file-names option already has, as you would prefer. But actually this isn't vitally important, since it would be possible to recreate the behavior of --restrict-file-names with --rename. So any % from hex escapes could also be filtered? Yep. Here's a test command line I just tried: $ wget --restrict-file-names=windows --rename 's/%/@/g' 'jerkface.net/~d/t...@at|pipe%percent.txt' [...] “t...@at@7cp...@percent.txt” saved [0/0] Here's another one that doesn't use --restrict-file-names: $ wget --rename 's/[...@%\|\/:?*]/uc sprintf @%x, ord $/ge' 'jerkface.net/~d/t...@at|pipe%percent.txt' [...] “t...@40at@7cp...@25percent.txt” saved [0/0] That version escapes the same characters as --restrict-file-names=windows, but uses hex codes marked with @, so that you can reliably convert the filenames back to the originals using another regular expression. Sorry, i'm a little bit puzzled this would fix all of my troubles, just like that :) If wget output filenames are the extent of your troubles, your life is quickly approaching perfection... Most recent Cygwin is still on wget-1.11.4 with patch level numero 4. Your patch applies directly to Debian 1.12 patch level 1. Actually, the version in those debs is: GNU Wget 1.12.1-devel (2340fa0d1b78) built on linux-gnu. That is, one behind the latest version in mercurial. I just checked, and the patch applies to the latest version as well. That is, it definitely applies to these versions: changeset: 2647:14f751f028c2 tag: tip user:Paul Townsend a...@purdue.edu date:Wed Jan 27 10:08:26 2010 -0800 summary: Time-measurement fix changeset: 2646:2340fa0d1b78 user:Micah Cowan mi...@cowan.name date:Wed Jan 13 20:41:15 2010 -0800 summary: Fixed some mixed declarations-and-code. So hopefully either TOS (the original source) 1.12 or the old Cygwin patched source will do. Otherwise i'll have to somehow shove it in manually ? :)) (I'ts been a while since my last attempt to compile anything anywhere, can you tell ? ;) ) I'm pretty sure it won't apply to 1.12, but that you only have to move some words around on a split line in the Makefile.am or something like that. I'd bet the version from mercurial compiles in cygwin, anyway. Good luck :)
[Bug-wget] trouble with URL vs local file names
Heja :) When using recursive or page requisite downloading local folders are created. Is there any way to switch that off without loosing the URL? It would make much sense to me for archival if local files were flat but including as much of the URL as possible. But, all files in ONE folder. Like so, made up example: wget -p -k www.google.de creates locally the folders / files ./www.google.de ./www.google.de/index.html ./www.google.de/logos ./www.google.de/logos/olympics10-skeleton-hp.png I'd prefer if that were ONLY files ./www.google...@index.html ./www.google...@logos@olympics10-skeleton-hp.png Or a URL like this http://www.google.de/csi?v=3s=webhpaction=e=17259,17311,22713,23386,23756,23806ei=JwZ9S-StKYaC_Aatp-z3BAexpi=17259,17311,22713,23386,23756,23806imc=1imn=1imp=1rt=prt.41,xjsls.93,xjses.163,xjsee.206,xjs.229,ol.468,iml.241 locally as file name www.google...@csi@v...@3@s...@webhp@action@@e...@17259,17311,22713,23386,23756 [etc ...] How would that be possible? On more complicated pages or when getting not only one page in one folder everything is otherwise spread around many (sub-)folders which makes viewing lateron more difficult than need be. I'm aware of the --no-directories but that does not retain the info (or approximation thereof) what file name something has had on the server. Or even which server it comes from. (Having protocol in name i could live without. The server on host-span not so much.) When using --no-directories with -N things just get written over or without it there would be just too many copies of the same URL, across several calls to wget. It also makes further processing impossible as the original URL is pretty much lost. And i'm also having trouble with the way files are named locally, the --restrict-file-names= thingie. Is there any way to also block % = (and possibly others i can't think of right now - + maybe?) locally as these seem to prevent further processing in batch scripts? As mentioned above i'm more of a fan of @ for placeholders. Rarely (never?) used in http, and does not seem to make any trouble when scripting. I'm on Windows with Cygwin and mixing of both batch (cmd.exe) and shell (sh, tcsh ...) scripting as well as (Win)DOS and Cygwin utilities might happen. In other words, these are unsafe to me, when filename is passed to anything via command line. (Really haven't found any way to escape these in some situations. Different type of quotes a-plenty, backslashes too, nothing helps.) http://wget.addictivecode.org/FeatureSpecifications mentions ContentFilters, where are these - or a description - to be found? Future? Is the translate URIs to local filenames mentioned there the same one i'm having trouble with? Preferrably the flat path-/filenames thing could be built in to wget as --flat :) (Maybe i'm imagining it a lot more simple, but if there already is a central point where escaping for local file names happens, could the slashes and backslashes just be removed before creating folders?) Thanks!
Re: [Bug-wget] trouble with URL vs local file names
(answers inline) Tobias Senz wrote: Heja :) When using recursive or page requisite downloading local folders are created. Is there any way to switch that off without loosing the URL? It would make much sense to me for archival if local files were flat but including as much of the URL as possible. But, all files in ONE folder. Like so, made up example: wget -p -k www.google.de creates locally the folders / files ./www.google.de ./www.google.de/index.html ./www.google.de/logos ./www.google.de/logos/olympics10-skeleton-hp.png I'd prefer if that were ONLY files ./www.google...@index.html ./www.google...@logos@olympics10-skeleton-hp.png Or a URL like this http://www.google.de/csi?v=3s=webhpaction=e=17259,17311,22713,23386,23756,23806ei=JwZ9S-StKYaC_Aatp-z3BAexpi=17259,17311,22713,23386,23756,23806imc=1imn=1imp=1rt=prt.41,xjsls.93,xjses.163,xjsee.206,xjs.229,ol.468,iml.241 locally as file name www.google...@csi@v...@3@s...@webhp@action@@e...@17259,17311,22713,23386,23756 [etc ...] How would that be possible? I'm afraid it isn't, currently. And i'm also having trouble with the way files are named locally, the --restrict-file-names= thingie. Is there any way to also block % = (and possibly others i can't think of right now - + maybe?) locally as these seem to prevent further processing in batch scripts? As mentioned above i'm more of a fan of @ for placeholders. Rarely (never?) used in http, and does not seem to make any trouble when scripting. Wget doesn't currently support arbitrary name restrictions. -- Micah J. Cowan http://micah.cowan.name/