Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filename with perl

Micah Cowan Sun, 04 Aug 2013 11:48:54 -0700

On Sat, Aug 03, 2013 at 11:50:48PM +0200, Ángel González wrote:
> On 03/08/13 21:07, Micah Cowan wrote:
> >On Sat, Aug 03, 2013 at 04:11:59PM +0200, Tim Rühsen wrote:
> >>As a second option, we could introduce (now or later)
> >>    --name-filter-program="program REGEX"
> >>
> >>The 'program' answers each line it gets (the original filename) by excactly 
> >>one
> >>output line (the new filename) as long as Wget does not close the pipe.
> >>The 'program' needs to be started only once...
> >Given the difficulty for novice users to ensure that the program is
> >line-buffered (unless, again, we do something like allocate a ptty), I
> >still feel that spawn-once will pose too much "surprise" (as in
> >"principle of least surprise") to non-expert shell folks to be the
> >default. And I still feel that it doesn't necessarily even pose any
> >realistic advantage, given that we're likely to wait on network reads
> >long enough for the transform to take place in the meantime.
> If stdbuf(1) was installed, wget could use it to disable the std buffering.
> Adding yet more variation between systems...


Thanks for pointing that out; I'd completely forgotten about stdbuf...

IMO, though, stdbuf is a hack; very convenient when you need such a
thing, but ultimately pretty unreliable (undefined behavior for some
uses (unrelated to our needs), won't always have the effect we want (if
the wrapped program explicitly adjusts its buffers, as tee does).

As far as variation between systems is concerned, though, a possible
choice would be to disable the --name-filter-program option unless
stdbuf exists. Of course, it could always exist at configure time, and
be absent at runtime... and it would probably limit the number of OSses
to handle this feature unacceptably.

> I don't think wget should care about “not using too many pids”.

Yeah, nor I. I didn't understand that complaint at all. Though I thought
I'd throw it out there, in case someone else had some idea why it could
be a valid reason.

> Although when continuing a recursive download where most files are already
> downloaded, it will need to rewrite a lot of filenames in rapid
> sucession, so I
> wonder if it could trigger some forking rate limit (intended to
> prevent fork
> bombs, presumably).

I dunno. Can't a loop over sed in a shell script produce the same
problem? I haven't seen that before, myself. There is a "maximum number
of processes" limit on my GNU/Linux OS, which makes better sense to
me, since that prevents fork bombs without limiting typical shell usage.

But the recursive download situation - and possibly a "download from
localhost" situation, are among the exceptions that would cause such
frequent spawns to likely become less inefficient.

Although, in such a case, if the files meant to be transformed already
exist, wouldn't they also already be transformed? In which case they'd
be redownloaded, in the absense of some sort of database that can map
original URLs to current files.

> >...I don't know anything about PCRE, but I'm hoping it has its own
> >parser for the common "s///" idiom, so Wget wouldn't have to write/debug
> >our own.
> I don't think we should allow letters as separation character. Which should
> fix the issue (inspired by php behavior on preg_* functions:
> “Delimiter must
> not be alphanumeric or backslash”).

Yes; although if PCRE has its own s/// parser, as I'd hope, this choice
may be unavailable to us. It'd simply be impossible for them to use s as
a separator, without also prefixing it with s (if they really wanted,
they could do ss...s...s). But this is silly. No one here's going to
spend time on support for a user that's doing that. :)

> >Oh yeah, while we're still on the subject, it might be worth pointing
> >out that Niwt also has a "unique name" protocol that works as follows
> >(Wget might find it handy, especially in combination with a name
> >transform). When Niwt can't save the file name it wants, it feeds the
> >"name-uniquer" program the intended file name as an argument, and the
> >uniquer is expected to print an infinite series of incremented names;
> >Niwt reads file names until it finds the first one that it can create
> >exclusively, and then closes the pipe.

<snip>

> The wget-1.10.2.tar.gz example isn't the worst vresioned-program
> transfomation. If you had
> program-2.0.tgz, it would become program-2.1.0.tgz :(

Yeah, excellent point. That's even less acceptable.

...Trying to think of a way to still use this model, but avoid that
problem. Could stop at the first fully numeric component, but then that
doesn't work for program-2.0c.tgz. Could stop at any component
containing a number, but that doesn't work for "bz2". Or components
prefixed with numbers, but I imagine there are file extensions like that
too.

Didn't want to force the uniquer to have to recognize filetypes, since
that's a maintenance problem, though in practice it's probably only
necessary to recognize compression-format extensions, which reduces the
maintenance issue to some degree.

But I also didn't want Niwt to use Wget's idiom, as it can be impractical
for downloading things and then viewing them with a web browser or what
not.

Obviously, the whole point of making the uniquer a separate program is
that users can work around such issues themselves; but I'd want to avoid
forcing them to do that wherever feasible.

-mjc

Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filename with perl

Reply via email to