Re: Wget scriptability

2008-08-03 Thread Dražen Kačar
Micah Cowan wrote:

 Okay, so there's been a lot of thought in the past, regarding better
 extensibility features for Wget. Things like hooks for adding support
 for traversal of new Content-Types besides text/html, or adding some
 form of JavaScript support, or support for MetaLink. Also, support for
 being able to filter results pre- and post-processing by Wget: for
 example, being able to do some filtering on the HTML to change how Wget
 sees it before parsing for links, but without affecting the actual
 downloaded version; or filtering the links themselves to alter what Wget
 fetches.

 However, another thing that's been vaguely itching at me lately, is the
 fact that Wget's design is not particularly unix-y. Instead of doing one
 thing, and doing it well, it does a lot of things, some well, some not.

It does what various people needed. It wasn't an excercise in writing a
unixy utility. It was a program that solved real problems for real
people.

 But the thing everyone loves about Unix and GNU (and certainly the thing
 that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
 paradigm,

I have always hated that. With a passion.

  - The tools themselves, as much as possible, should be written in an
 easily-hackable scripting language. Python makes a good candidate. Where
 we want efficiency, we can implement modules in C to do the work.

At the time Wget was conceived, that was Tcl's mantra. It failed
miserably. :-)

How about concentrating on the problems listed in your first paragraph
(which is why I quoted it)? Could you show us how would a buch of shell
tools solve them? Or how would a librarized Wget solve them? Or how
would any other paradigm or architecture or whatever solve them?

-- 
 .-.   .-.Yes, I am an agent of Satan, but my duties are largely
(_  \ /  _)   ceremonial.
 |
 |[EMAIL PROTECTED]


Re: Wget scriptability

2008-08-03 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Dražen Kačar wrote:
 Micah Cowan wrote:
 
 Okay, so there's been a lot of thought in the past, regarding better
 extensibility features for Wget. Things like hooks for adding support
 for traversal of new Content-Types besides text/html, or adding some
 form of JavaScript support, or support for MetaLink. Also, support for
 being able to filter results pre- and post-processing by Wget: for
 example, being able to do some filtering on the HTML to change how Wget
 sees it before parsing for links, but without affecting the actual
 downloaded version; or filtering the links themselves to alter what Wget
 fetches.
 
 However, another thing that's been vaguely itching at me lately, is the
 fact that Wget's design is not particularly unix-y. Instead of doing one
 thing, and doing it well, it does a lot of things, some well, some not.
 
 It does what various people needed. It wasn't an excercise in writing a
 unixy utility. It was a program that solved real problems for real
 people.

 But the thing everyone loves about Unix and GNU (and certainly the thing
 that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
 paradigm,
 
 I have always hated that. With a passion.

A surprising position from a user of Mutt, whose excellence is due in no
small part to its ability to integrate well with other command utilities
(that is, to pipeline). The power and flexibility of pipelines is
extremely well-established in the Unix world; I feel no need whatsoever
to waste breath arguing for it, particularly when you haven't provided
the reasons you hate it.

For my part, I'm not exaggerating that it's single-handedly responsible
for why I'm a Unix/GNU user at all, and why I continue to highly enjoy
developing on it.

  find -name '*.html' -exec sed -i \
's#http://oldhost/#http://newhost/#g' \;

  ( cat message; echo; echo '-- '; cat ~/.signature ) | \
gpg --clearsign | mail -s 'Report' [EMAIL PROTECTED]

  pic | tbl | eqn | eff-ing | troff -ms

Each one of these demonstrates the enormously powerful technique of
using distinct tools with distinct feature domains, together to form a
cohesive solution for the need. The best part is (with the possible
exception of the troff pipeline), each of these components are
immediately available for use in some other pipeline that does some
other completely different function.

Note, though, that I don't intend that using Piped-Wget would actually
mean the user types in a special pipeline each time he wants to do
something with it. The primary driver would read in some config file
that would tell wget how it should do the piping. You just tweak the
config file when you want to add new functionality.

  - The tools themselves, as much as possible, should be written in an
 easily-hackable scripting language. Python makes a good candidate. Where
 we want efficiency, we can implement modules in C to do the work.
 
 At the time Wget was conceived, that was Tcl's mantra. It failed
 miserably. :-)

Are you claiming that Tcl's failure was due to the ability to integrate
it with C, rather than its abysmal inadequacy as a programming language
(changing it from an ability to integrate with C, to an absolute
requirement to do so in order to get anything accomplished)?

 How about concentrating on the problems listed in your first paragraph
 (which is why I quoted it)? Could you show us how would a buch of shell
 tools solve them? Or how would a librarized Wget solve them? Or how
 would any other paradigm or architecture or whatever solve them?

It should be trivially obvious: you plug them in, rather than wait for
the Wget developers to get around to implementing it.

The thing that both library-ized Wget and pipeline-ized Wget would offer
is the same: extreme flexibility. It puts the users in control of what
Wget does, rather than just perpetually hearing, sorry, Wget can't do
it: you could hack the source, though. :p

The difference between the two is that a pipelined Wget offers this
flexibility to a wider range of users, whereas a library Wget offers it
to C programmers.

Or how would you expect to do these things without a library-ized (at
least) Wget? Implementing them in the core app (at least by default) is
clearly wrong (scope bloat). Giving Wget a plugin architecture is good,
but then there's only as much flexibility as there are hooks.
Libraryizing Wget is equivalent to providing everything as hooks, and
puts the program using it in the driver's seat (and, naturally, there'd
be a wrapper implementation, like curl for libcurl). A suite of
interconnected utilities does the same, but is more accessible to
greater numbers of people. Generally at some expense to efficiency
(aren't all flexible architectures?); but Wget isn't CPU-bound, it's
network-bound.

As mentioned in my original post, this would be a separate project from
Wget. Wget would not be going away (though it seems likely to me that it
would quickly reach a primarily 

Wget scriptability

2008-08-02 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Okay, so there's been a lot of thought in the past, regarding better
extensibility features for Wget. Things like hooks for adding support
for traversal of new Content-Types besides text/html, or adding some
form of JavaScript support, or support for MetaLink. Also, support for
being able to filter results pre- and post-processing by Wget: for
example, being able to do some filtering on the HTML to change how Wget
sees it before parsing for links, but without affecting the actual
downloaded version; or filtering the links themselves to alter what Wget
fetches.

The original concept before I came onboard, was plugin modules. After
some thought, I'd decided I didn't like this overly much, and have
mainly been leading toward the idea of a next-gen Wget-as-a-library
thing, probably wrapping libcurl (and with a command-client version,
like curl). This obviously wouldn't have been a Wget any more, so would
have been a separate project, with a different name.

However, another thing that's been vaguely itching at me lately, is the
fact that Wget's design is not particularly unix-y. Instead of doing one
thing, and doing it well, it does a lot of things, some well, some not.

So the last couple days I've been thinking, maybe wget-ng should be a
suite of interoperating shell utilities, rather than a library or a
single app. This could have some really huge advantages: users could
choose their own html-parser to use, they can plug in parsers for
whatever filetypes they desire, people who want to implement exotic
features can do that...

Of course, at this point we're talking about something that's
fundamentally different from Wget. Just as we were when we were
considering making a next-gen library version. It'd be a completely
separate project. And I'm still not going to start it right away (though
I think some preliminary requirements and design discussions would be a
good idea). Wget's not going to die, nor is everyone going to want to
switch to some new-fangled re-envisioning of it.

But the thing everyone loves about Unix and GNU (and certainly the thing
that drew me to them), is the bunch-of-tools-on-a-crazy-pipeline
paradigm, which is what enables you to mix-and-match different tools to
cover the different areas of functionality. Wget doesn't fit very well
into that scheme, and I think it could become even much more powerful
than it already is, by being broken into smaller, more discreet,
projects. Or, to be more precise, to offer an alternative that does the
equivalent.

So far, the following principles have struck me as advisable for a
project such as this:

 - The tools themselves, as much as possible, should be written in an
easily-hackable scripting language. Python makes a good candidate. Where
we want efficiency, we can implement modules in C to do the work.

 - While efficiency won't be the highest priority (else we'd just stick
to the monolith), it's still important. Spawning off separate processes
to each fetch their own page, initiating a new connection each time,
would be a lousy idea. So, the architectural model should center around
a URL-getter driver, that manages connections and such, reusing
persistent ones as much as possible. Of course, there might be distinct
commands to handle separate types of URLs, (or alternative methods for
handling them, such as MetaLink), and perhaps not all of these would be
able to do persistence (a dead-simple way to add support for scp, etc,
might be to simply call the command-line program).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIlEcX7M8hyUobTrERAqvSAJ9rx99xhU7Zo/xwbKXDbWCWp4jSQwCfbbQM
zmY9j1zYuGq0eNkZnsqR+Jo=
=8wcf
-END PGP SIGNATURE-