Re: Wget scriptability
Micah Cowan wrote: Okay, so there's been a lot of thought in the past, regarding better extensibility features for Wget. Things like hooks for adding support for traversal of new Content-Types besides text/html, or adding some form of JavaScript support, or support for MetaLink. Also, support for being able to filter results pre- and post-processing by Wget: for example, being able to do some filtering on the HTML to change how Wget sees it before parsing for links, but without affecting the actual downloaded version; or filtering the links themselves to alter what Wget fetches. However, another thing that's been vaguely itching at me lately, is the fact that Wget's design is not particularly unix-y. Instead of doing one thing, and doing it well, it does a lot of things, some well, some not. It does what various people needed. It wasn't an excercise in writing a unixy utility. It was a program that solved real problems for real people. But the thing everyone loves about Unix and GNU (and certainly the thing that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline paradigm, I have always hated that. With a passion. - The tools themselves, as much as possible, should be written in an easily-hackable scripting language. Python makes a good candidate. Where we want efficiency, we can implement modules in C to do the work. At the time Wget was conceived, that was Tcl's mantra. It failed miserably. :-) How about concentrating on the problems listed in your first paragraph (which is why I quoted it)? Could you show us how would a buch of shell tools solve them? Or how would a librarized Wget solve them? Or how would any other paradigm or architecture or whatever solve them? -- .-. .-.Yes, I am an agent of Satan, but my duties are largely (_ \ / _) ceremonial. | |[EMAIL PROTECTED]
Re: Wget scriptability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dražen Kačar wrote: Micah Cowan wrote: Okay, so there's been a lot of thought in the past, regarding better extensibility features for Wget. Things like hooks for adding support for traversal of new Content-Types besides text/html, or adding some form of JavaScript support, or support for MetaLink. Also, support for being able to filter results pre- and post-processing by Wget: for example, being able to do some filtering on the HTML to change how Wget sees it before parsing for links, but without affecting the actual downloaded version; or filtering the links themselves to alter what Wget fetches. However, another thing that's been vaguely itching at me lately, is the fact that Wget's design is not particularly unix-y. Instead of doing one thing, and doing it well, it does a lot of things, some well, some not. It does what various people needed. It wasn't an excercise in writing a unixy utility. It was a program that solved real problems for real people. But the thing everyone loves about Unix and GNU (and certainly the thing that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline paradigm, I have always hated that. With a passion. A surprising position from a user of Mutt, whose excellence is due in no small part to its ability to integrate well with other command utilities (that is, to pipeline). The power and flexibility of pipelines is extremely well-established in the Unix world; I feel no need whatsoever to waste breath arguing for it, particularly when you haven't provided the reasons you hate it. For my part, I'm not exaggerating that it's single-handedly responsible for why I'm a Unix/GNU user at all, and why I continue to highly enjoy developing on it. find -name '*.html' -exec sed -i \ 's#http://oldhost/#http://newhost/#g' \; ( cat message; echo; echo '-- '; cat ~/.signature ) | \ gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED] pic | tbl | eqn | eff-ing | troff -ms Each one of these demonstrates the enormously powerful technique of using distinct tools with distinct feature domains, together to form a cohesive solution for the need. The best part is (with the possible exception of the troff pipeline), each of these components are immediately available for use in some other pipeline that does some other completely different function. Note, though, that I don't intend that using Piped-Wget would actually mean the user types in a special pipeline each time he wants to do something with it. The primary driver would read in some config file that would tell wget how it should do the piping. You just tweak the config file when you want to add new functionality. - The tools themselves, as much as possible, should be written in an easily-hackable scripting language. Python makes a good candidate. Where we want efficiency, we can implement modules in C to do the work. At the time Wget was conceived, that was Tcl's mantra. It failed miserably. :-) Are you claiming that Tcl's failure was due to the ability to integrate it with C, rather than its abysmal inadequacy as a programming language (changing it from an ability to integrate with C, to an absolute requirement to do so in order to get anything accomplished)? How about concentrating on the problems listed in your first paragraph (which is why I quoted it)? Could you show us how would a buch of shell tools solve them? Or how would a librarized Wget solve them? Or how would any other paradigm or architecture or whatever solve them? It should be trivially obvious: you plug them in, rather than wait for the Wget developers to get around to implementing it. The thing that both library-ized Wget and pipeline-ized Wget would offer is the same: extreme flexibility. It puts the users in control of what Wget does, rather than just perpetually hearing, sorry, Wget can't do it: you could hack the source, though. :p The difference between the two is that a pipelined Wget offers this flexibility to a wider range of users, whereas a library Wget offers it to C programmers. Or how would you expect to do these things without a library-ized (at least) Wget? Implementing them in the core app (at least by default) is clearly wrong (scope bloat). Giving Wget a plugin architecture is good, but then there's only as much flexibility as there are hooks. Libraryizing Wget is equivalent to providing everything as hooks, and puts the program using it in the driver's seat (and, naturally, there'd be a wrapper implementation, like curl for libcurl). A suite of interconnected utilities does the same, but is more accessible to greater numbers of people. Generally at some expense to efficiency (aren't all flexible architectures?); but Wget isn't CPU-bound, it's network-bound. As mentioned in my original post, this would be a separate project from Wget. Wget would not be going away (though it seems likely to me that it would quickly reach a primarily
Wget scriptability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Okay, so there's been a lot of thought in the past, regarding better extensibility features for Wget. Things like hooks for adding support for traversal of new Content-Types besides text/html, or adding some form of JavaScript support, or support for MetaLink. Also, support for being able to filter results pre- and post-processing by Wget: for example, being able to do some filtering on the HTML to change how Wget sees it before parsing for links, but without affecting the actual downloaded version; or filtering the links themselves to alter what Wget fetches. The original concept before I came onboard, was plugin modules. After some thought, I'd decided I didn't like this overly much, and have mainly been leading toward the idea of a next-gen Wget-as-a-library thing, probably wrapping libcurl (and with a command-client version, like curl). This obviously wouldn't have been a Wget any more, so would have been a separate project, with a different name. However, another thing that's been vaguely itching at me lately, is the fact that Wget's design is not particularly unix-y. Instead of doing one thing, and doing it well, it does a lot of things, some well, some not. So the last couple days I've been thinking, maybe wget-ng should be a suite of interoperating shell utilities, rather than a library or a single app. This could have some really huge advantages: users could choose their own html-parser to use, they can plug in parsers for whatever filetypes they desire, people who want to implement exotic features can do that... Of course, at this point we're talking about something that's fundamentally different from Wget. Just as we were when we were considering making a next-gen library version. It'd be a completely separate project. And I'm still not going to start it right away (though I think some preliminary requirements and design discussions would be a good idea). Wget's not going to die, nor is everyone going to want to switch to some new-fangled re-envisioning of it. But the thing everyone loves about Unix and GNU (and certainly the thing that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline paradigm, which is what enables you to mix-and-match different tools to cover the different areas of functionality. Wget doesn't fit very well into that scheme, and I think it could become even much more powerful than it already is, by being broken into smaller, more discreet, projects. Or, to be more precise, to offer an alternative that does the equivalent. So far, the following principles have struck me as advisable for a project such as this: - The tools themselves, as much as possible, should be written in an easily-hackable scripting language. Python makes a good candidate. Where we want efficiency, we can implement modules in C to do the work. - While efficiency won't be the highest priority (else we'd just stick to the monolith), it's still important. Spawning off separate processes to each fetch their own page, initiating a new connection each time, would be a lousy idea. So, the architectural model should center around a URL-getter driver, that manages connections and such, reusing persistent ones as much as possible. Of course, there might be distinct commands to handle separate types of URLs, (or alternative methods for handling them, such as MetaLink), and perhaps not all of these would be able to do persistence (a dead-simple way to add support for scp, etc, might be to simply call the command-line program). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer. GNU Maintainer: wget, screen, teseq http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIlEcX7M8hyUobTrERAqvSAJ9rx99xhU7Zo/xwbKXDbWCWp4jSQwCfbbQM zmY9j1zYuGq0eNkZnsqR+Jo= =8wcf -END PGP SIGNATURE-