Re: [Bug-wget] trouble with URL vs local file names

2010-02-19 Thread Andrew Cady
On Fri, Feb 19, 2010 at 10:01:18AM +0100, Tobias Senz wrote:
 On 19.02.2010 00:44, Andrew Cady wrote:
  I have written a patch for wget which makes this behavior possible.  The
  very use case that you describe is illustrated in the email to this list
  which contains the patch:
  
http://lists.gnu.org/archive/html/bug-wget/2010-01/msg00021.html

[...]

 Just to verify, and be 110% certain, your additional renaming happens
 AFTER all of the built-in escaping / renaming wget would do without the
 patch?

Well, I just tested, and yes: it turns out the --rename modifies the
output filename after the --restrict-file-names option already has, as
you would prefer.  But actually this isn't vitally important, since it
would be possible to recreate the behavior of --restrict-file-names with
--rename.

 So any % from hex escapes could also be filtered?

Yep.  Here's a test command line I just tried:

  $ wget --restrict-file-names=windows --rename 's/%/@/g' 
'jerkface.net/~d/t...@at|pipe%percent.txt'
  [...] “t...@at@7cp...@percent.txt” saved [0/0]

Here's another one that doesn't use --restrict-file-names:

  $ wget --rename 's/[...@%\|\/:?*]/uc sprintf @%x, ord $/ge' 
'jerkface.net/~d/t...@at|pipe%percent.txt'
  [...] “t...@40at@7cp...@25percent.txt” saved [0/0]

That version escapes the same characters as --restrict-file-names=windows,
but uses hex codes marked with @, so that you can reliably convert the
filenames back to the originals using another regular expression.

 Sorry, i'm a little bit puzzled this would fix all of my troubles, just
 like that :)

If wget output filenames are the extent of your troubles, your life is
quickly approaching perfection...

 Most recent Cygwin is still on wget-1.11.4 with patch level numero 4.
 Your patch applies directly to Debian 1.12 patch level 1. 

Actually, the version in those debs is:

  GNU Wget 1.12.1-devel (2340fa0d1b78) built on linux-gnu.

That is, one behind the latest version in mercurial.  I just checked,
and the patch applies to the latest version as well.  That is, it
definitely applies to these versions:

  changeset:   2647:14f751f028c2
  tag: tip
  user:Paul Townsend a...@purdue.edu
  date:Wed Jan 27 10:08:26 2010 -0800
  summary: Time-measurement fix

  changeset:   2646:2340fa0d1b78
  user:Micah Cowan mi...@cowan.name
  date:Wed Jan 13 20:41:15 2010 -0800
  summary: Fixed some mixed declarations-and-code.

 So hopefully either TOS (the original source) 1.12 or the old Cygwin
 patched source will do. 

 Otherwise i'll have to somehow shove it in manually ? :)) (I'ts been a
 while since my last attempt to compile anything anywhere, can you tell
 ? ;) )

I'm pretty sure it won't apply to 1.12, but that you only have to move
some words around on a split line in the Makefile.am or something like
that.  I'd bet the version from mercurial compiles in cygwin, anyway.

Good luck :)




[Bug-wget] trouble with URL vs local file names

2010-02-18 Thread Tobias Senz
Heja :)

When using recursive or page requisite downloading local folders are
created. Is there any way to switch that off without loosing the URL?
It would make much sense to me for archival if local files were flat
but including as much of the URL as possible.
But, all files in ONE folder.

Like so, made up example:

wget -p -k www.google.de

creates locally the folders / files
./www.google.de
./www.google.de/index.html
./www.google.de/logos
./www.google.de/logos/olympics10-skeleton-hp.png

I'd prefer if that were ONLY files
./www.google...@index.html
./www.google...@logos@olympics10-skeleton-hp.png

Or a URL like this
http://www.google.de/csi?v=3s=webhpaction=e=17259,17311,22713,23386,23756,23806ei=JwZ9S-StKYaC_Aatp-z3BAexpi=17259,17311,22713,23386,23756,23806imc=1imn=1imp=1rt=prt.41,xjsls.93,xjses.163,xjsee.206,xjs.229,ol.468,iml.241

locally as file name
www.google...@csi@v...@3@s...@webhp@action@@e...@17259,17311,22713,23386,23756
[etc ...]

How would that be possible?
On more complicated pages or when getting not only one page in one
folder everything is otherwise spread around many (sub-)folders which
makes viewing lateron more difficult than need be.

I'm aware of the --no-directories but that does not retain the info
(or approximation thereof) what file name something has had on the
server. Or even which server it comes from. (Having protocol in name i
could live without. The server on host-span not so much.)
When using --no-directories with -N things just get written over or
without it there would be just too many copies of the same URL, across
several calls to wget. It also makes further processing impossible as
the original URL is pretty much lost.


And i'm also having trouble with the way files are named locally, the
--restrict-file-names= thingie.
Is there any way to also block %  = (and possibly others i can't
think of right now - + maybe?) locally as these seem to prevent
further processing in batch scripts? As mentioned above i'm more of a
fan of @ for placeholders. Rarely (never?) used in http, and does not
seem to make any trouble when scripting.

I'm on Windows with Cygwin and mixing of both batch (cmd.exe) and shell
(sh, tcsh ...) scripting as well as (Win)DOS and Cygwin utilities might
happen. In other words, these are unsafe to me, when filename is
passed to anything via command line. (Really haven't found any way to
escape these in some situations. Different type of quotes a-plenty,
backslashes too, nothing helps.)

http://wget.addictivecode.org/FeatureSpecifications
mentions ContentFilters, where are these - or a description - to be
found? Future?
Is the translate URIs to local filenames mentioned there the same one
i'm having trouble with?
Preferrably the flat path-/filenames thing could be built in to wget
as --flat :)
(Maybe i'm imagining it a lot more simple, but if there already is a
central point where escaping for local file names happens, could the
slashes and backslashes just be removed before creating folders?)

Thanks!




Re: [Bug-wget] trouble with URL vs local file names

2010-02-18 Thread Micah Cowan
(answers inline)

Tobias Senz wrote:
 Heja :)
 
 When using recursive or page requisite downloading local folders are
 created. Is there any way to switch that off without loosing the URL?
 It would make much sense to me for archival if local files were flat
 but including as much of the URL as possible.
 But, all files in ONE folder.
 
 Like so, made up example:
 
 wget -p -k www.google.de
 
 creates locally the folders / files
 ./www.google.de
 ./www.google.de/index.html
 ./www.google.de/logos
 ./www.google.de/logos/olympics10-skeleton-hp.png
 
 I'd prefer if that were ONLY files
 ./www.google...@index.html
 ./www.google...@logos@olympics10-skeleton-hp.png
 
 Or a URL like this
 http://www.google.de/csi?v=3s=webhpaction=e=17259,17311,22713,23386,23756,23806ei=JwZ9S-StKYaC_Aatp-z3BAexpi=17259,17311,22713,23386,23756,23806imc=1imn=1imp=1rt=prt.41,xjsls.93,xjses.163,xjsee.206,xjs.229,ol.468,iml.241
 
 locally as file name
 www.google...@csi@v...@3@s...@webhp@action@@e...@17259,17311,22713,23386,23756
 [etc ...]
 
 How would that be possible?

I'm afraid it isn't, currently.

 And i'm also having trouble with the way files are named locally, the
 --restrict-file-names= thingie.
 Is there any way to also block %  = (and possibly others i can't
 think of right now - + maybe?) locally as these seem to prevent
 further processing in batch scripts? As mentioned above i'm more of a
 fan of @ for placeholders. Rarely (never?) used in http, and does not
 seem to make any trouble when scripting.

Wget doesn't currently support arbitrary name restrictions.

-- 
Micah J. Cowan
http://micah.cowan.name/