On Sat, Aug 03, 2013 at 11:50:48PM +0200, Ángel González wrote: > On 03/08/13 21:07, Micah Cowan wrote: > >On Sat, Aug 03, 2013 at 04:11:59PM +0200, Tim Rühsen wrote: > >>As a second option, we could introduce (now or later) > >> --name-filter-program="program REGEX" > >> > >>The 'program' answers each line it gets (the original filename) by excactly > >>one > >>output line (the new filename) as long as Wget does not close the pipe. > >>The 'program' needs to be started only once... > >Given the difficulty for novice users to ensure that the program is > >line-buffered (unless, again, we do something like allocate a ptty), I > >still feel that spawn-once will pose too much "surprise" (as in > >"principle of least surprise") to non-expert shell folks to be the > >default. And I still feel that it doesn't necessarily even pose any > >realistic advantage, given that we're likely to wait on network reads > >long enough for the transform to take place in the meantime. > If stdbuf(1) was installed, wget could use it to disable the std buffering. > Adding yet more variation between systems...
Thanks for pointing that out; I'd completely forgotten about stdbuf... IMO, though, stdbuf is a hack; very convenient when you need such a thing, but ultimately pretty unreliable (undefined behavior for some uses (unrelated to our needs), won't always have the effect we want (if the wrapped program explicitly adjusts its buffers, as tee does). As far as variation between systems is concerned, though, a possible choice would be to disable the --name-filter-program option unless stdbuf exists. Of course, it could always exist at configure time, and be absent at runtime... and it would probably limit the number of OSses to handle this feature unacceptably. > I don't think wget should care about “not using too many pids”. Yeah, nor I. I didn't understand that complaint at all. Though I thought I'd throw it out there, in case someone else had some idea why it could be a valid reason. > Although when continuing a recursive download where most files are already > downloaded, it will need to rewrite a lot of filenames in rapid > sucession, so I > wonder if it could trigger some forking rate limit (intended to > prevent fork > bombs, presumably). I dunno. Can't a loop over sed in a shell script produce the same problem? I haven't seen that before, myself. There is a "maximum number of processes" limit on my GNU/Linux OS, which makes better sense to me, since that prevents fork bombs without limiting typical shell usage. But the recursive download situation - and possibly a "download from localhost" situation, are among the exceptions that would cause such frequent spawns to likely become less inefficient. Although, in such a case, if the files meant to be transformed already exist, wouldn't they also already be transformed? In which case they'd be redownloaded, in the absense of some sort of database that can map original URLs to current files. > >...I don't know anything about PCRE, but I'm hoping it has its own > >parser for the common "s///" idiom, so Wget wouldn't have to write/debug > >our own. > I don't think we should allow letters as separation character. Which should > fix the issue (inspired by php behavior on preg_* functions: > “Delimiter must > not be alphanumeric or backslash”). Yes; although if PCRE has its own s/// parser, as I'd hope, this choice may be unavailable to us. It'd simply be impossible for them to use s as a separator, without also prefixing it with s (if they really wanted, they could do ss...s...s). But this is silly. No one here's going to spend time on support for a user that's doing that. :) > >Oh yeah, while we're still on the subject, it might be worth pointing > >out that Niwt also has a "unique name" protocol that works as follows > >(Wget might find it handy, especially in combination with a name > >transform). When Niwt can't save the file name it wants, it feeds the > >"name-uniquer" program the intended file name as an argument, and the > >uniquer is expected to print an infinite series of incremented names; > >Niwt reads file names until it finds the first one that it can create > >exclusively, and then closes the pipe. <snip> > The wget-1.10.2.tar.gz example isn't the worst vresioned-program > transfomation. If you had > program-2.0.tgz, it would become program-2.1.0.tgz :( Yeah, excellent point. That's even less acceptable. ...Trying to think of a way to still use this model, but avoid that problem. Could stop at the first fully numeric component, but then that doesn't work for program-2.0c.tgz. Could stop at any component containing a number, but that doesn't work for "bz2". Or components prefixed with numbers, but I imagine there are file extensions like that too. Didn't want to force the uniquer to have to recognize filetypes, since that's a maintenance problem, though in practice it's probably only necessary to recognize compression-format extensions, which reduces the maintenance issue to some degree. But I also didn't want Niwt to use Wget's idiom, as it can be impractical for downloading things and then viewing them with a web browser or what not. Obviously, the whole point of making the uniquer a separate program is that users can work around such issues themselves; but I'd want to avoid forcing them to do that wherever feasible. -mjc
