Thoughts on Wget 1.x, 2.0 (LONG!)

Micah Cowan Fri, 26 Oct 2007 18:21:08 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

With talk of supporting multiple simultaneous connections in a
next-generation version of Wget, various things have been tumbling
around in my mind.


First off is that I would not wish to do such a thing with threads.
Threads introduce too many problems of their own, including portability
and debugability. I'd much prefer to do asynchronous I/O.

With the use of asynchronous I/O, a (possibly) better way to do
- --timeout presents itself: we can do the appropriate timeouts in our
calls to select(). The main advantage to this is that we don't have to
muck around with signals, signal handling, various portability issues,
etc. We can do one --timeout and be done.

The primary downside to this is that potentially blocking, not directly
I/O things don't get timed out anymore. The only thing that currently
comes to mind is gethostbyname(), which obviously can block, but can't
be select()ed or set to some sort of non-blocking mode. Also, even aside
from --timeout, having all other traffic sit around and wait until a
name is resolved is not really desirable.

The obvious solution to that is to use c-ares, which does exactly that:
handle DNS queries asynchronously. Actually, I didn't know this until
just now, but c-ares was split off from ares to meet the needs of the
curl developers. :)

Of course, if we're doing asynchronous net I/O stuff, rather than
reinvent the wheel and try to maintain portability for new stuff, we're
better off using a prepackaged deal, if one exists. Luckily, one does; a
friend of mine (William Ahern) wrote a package called libevnet that
handles all of that; it wraps libevent (by Niels Provos, for handling
async I/O very portably and using the best available interfaces on the
given system) with higher-level socket and buffer I/O facilities and,
and provides a wrapper around c-ares that makes it convenient to use
with liblookup. If we're going to do async I/O, using libevent and
c-ares, or something very like them, is far too convenient not to do,
and after that decision is made, libevnet becomes a clear win too.

So, the obvious win is that using libevnet, libevent and c-ares gives us
 a "shortest path" to using async I/O, having multiple simultaneous
connections and async DNS queries, and a potentially better way to
manage timeouts.

The obvious loss, and one which I'm positive many of you are already
screaming at me about, is that we just added 3 library dependencies to
Wget in one go. Not freaking cool. Not freaking cool AT ALL.

- -= Wget's Strongest Points =-

I absolutely do not want to require a bunch of libraries in order for
people to build Wget. AFAICT, the vast majority of Wget's user base,
which is probably system packagers and distributors, use it for just the
following reasons:

  1. It's pretty small. Only dependency is OpenSSL, which isn't even
required, but of course in general nobody really doesn't want SSL. (Ooh
looky! Double negatives!)
  2. It's robust. Connection dropped? No prob, try again.
  3. It avoids mucking with preexisting files. Downloading a file named
"foo", but you already _have_ a "foo"? No prob, let's call it "foo.1".

To my mind, these are the core values that have led to so many different
distributions and large software packages relying on Wget. Messing with
any one of these is likely to lose Wget "customers", and in our largest
"target market". (DISCLAIMER: naturally I have nothing whatsoever to
back these claims up. It's conjecture. But it seems pretty credible to me.)

Another major "market" for Wget is the typical command-line "power
user", who uses Wget not only to grab off a quick file, but also to grab
whole sections of sites recursively, and perhaps with occasional quirky
needs like only-visit-these-domains or only-download-these-file-types.
For these people, while point #1 above probably holds relatively little
value, probably being replaced primarily by Wget's HTML-crawling
functionality. In addition to these, points that I believe are highly
desirable to such users are:

  - Being able to tell Wget precisely which files to download and which
to skip. The more expressive power we have to accomplish this the
better. Wget already has remarkable flexibility in this area; but there
are many more things that are desirable, and some of the existing
interface is not up to the task of really powerful expression in this area.
  - Being able to parse and "recursively descend" CSS is really, really
important.
  - Being able to do multiple connections, potentially accelerating the
total download time (mainly for multi-host sessions), would be a win.
  - Being able to extend Wget, to grok new filetypes for recursive
descent (such as non-HTML XML files, or JavaScript), or extend the power
of expression of "what to grab" even further.

- -= The Two Wgets =-

It seems to me, then, that what's really required may in fact be two
different "Wgets".

One that is lightweight but packs a punch: basically Wget as it already
currently is. Making it DTRT where it doesn't, such as with its
expectations of FTP servers, or how it handles HTTP authentication; and
adding CSS support would be _really_ helpful. In order to _keep_ it
lightweight, it would be necessary to keep a tight throttle on what new
functionality is accepted; it would be primarily _maintained_, and not
_developed_, though it would of course be kept up-to-date with evolving
definitions of what the World Wide Web is (CSS being an excellent
example). It would support recursive web fetching, but wouldn't bend
over backwards to handle the more exotic needs.

The other Wget would get all the "cool" stuff: pretty much everything
that has been planned for the "next-gen", "2.0" version of Wget. Its
focus would be on users that want it to be their "everything" tool, and
damn the hard-disk-space requirements and library dependencies (not
getting _too_ crazy, of course--that's what the plugin architecture is
for!).

This would certainly allay a growing fear I've had, that a lot of what
people were getting excited about in discussions of "Wget 2.0", just
plain doesn't _feel_ right for Wget. But were unquestionably useful
feature additions. I initially quelled this concern with the thought
that I could simply sequester the really exotic features to plugins,
allowing people the freedom to choose what they want their Wget to be.

But asynchronous I/O can't be simply partitioned away like that: it
requires intrinsic and pervasive changes to Wget's architecture. While
the two Wgets could share some logic for recursion, file naming,
timestamping, etc, the actual I/O wouldn't easily be sharable. Well:
code written for an async I/O platform can easily just be used
synchronously, but async code comes at a significant cost to legibility
and flexibility, that IMO woudn't be worth the cost in the "synchronous"
Wget.

Plus, there is the following thought. While I've talked about not
reinventing the wheel, using existing packages to save us the trouble of
having to maintain portable async code, higher-level buffered-IO and
network comm code, etc, I've been neglecting one more package choice.
There is, after all, already a Free Software package that goes beyond
handling asynchronous network operations, to specifically handle
asynchronous _web_ operations; I'm speaking, of course, of libcurl.
There would seem to be some obvious motivation for simply using libcurl
to handle all asynchronous web traffic, and wrapping it with the logic
we need to handle retries, recursion, timestamping, traversing,
selecting which files to download, etc. Besides async web code, of
course, we'd also automatically get support for a number of various
protocols (SFTP, for example) that have been requested in Wget.

PLEASE NOTE: these are ramblings. They are ideas. They are what's
currently rattling around in my brain. Note too, that there's a couple
leaps in the given logic for having a completely separate "Wget 2.0",
the biggest one probably being that multiple connections does not
automatically implying asynchronous I/O; that's just my preference.

Not going for async I/O destroys the depends-on-myriad-libraries
argument, and the whole Wget-2.0-needs-to-be-separate, and
maybe-we-should-use-libcurl arguments. OTOH, the other options I can
think of--using threads, or using multiple processes--have their own
strong downsides, especially in terms of portability and maintenance cost.

I expect this to be controversial thinking, and am hereby officially
begging for feedback, and for alternative thoughts and viewpoints.

And, of course, when I say "there would be two Wgets", what I really
mean by that is that the more exotic-featured one would be something
else entirely than a Wget, and would have a separate name.

- --
Micah J. Cowan
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHIpJk7M8hyUobTrERCBphAKCRAPkxU9wws4dbB3EjLW6vybg3wACbBncx
VZ260KdS5uWrYiCAncwMQVQ=
=Y0Sc
-----END PGP SIGNATURE-----

Thoughts on Wget 1.x, 2.0 (*LONG!*)

Reply via email to

Thoughts on Wget 1.x, 2.0 (LONG!)