wget  

Re: Release: GNU Wget 1.11.1

Micah Cowan
Tue, 25 Mar 2008 14:09:44 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Pattist wrote:
> Micah Cowan wrote:
>> Announcing the release of version 1.11.1 of GNU Wget.
>> ** Documentation of accept/reject lists in the manual's "Types of
>> Files" section now explains various aspects of their behavior that may
>> be surprising, and notes that they may change in the future.
> I'm glad to see that this made it into the docs - even if this behavior
> is drastically altered in the next rev.

Er, yeah. And, despite my complaining a bit that your version was a bit
longer than I wanted to spend explaining things, my version ended up
longer. :\

> I'm interested in your thoughts on the future of the accept/reject
> filter options.  Currently, accept/reject provides mixed control over
> file retention and link traversal.  Those filters do not apply to html
> files during the first pass through those filters (for traversal), but
> do apply during during the second pass for file retention.
> I can see splitting the accept/reject filters into two independent
> filter sets. One set would follow/no-follow  links and the other set
> would keep/delete files after retrieval.  Obviously query string
> matching would be nice in the first set.  OTOH, I can imagine keeping
> accept/reject solely to control file retention and using more advanced
> logic than simple htm/html extension matching to get deeper traversal of
> script sites when permitted by the recursion depth or other controls. 
> What do you see as the best approach?

That pretty much matches my thinking, and in fact also matches previous
discussions on the subject, I believe (except that I can't seem to
_find_ the discussion I'm thinking of).

> As long as I'm posting, I'll give some very minor feedback on the docs. 
> It would be nice to have a cross reference of the three formats - short
> option, long option and control file or just list all 3 in the first
> discussion of the option.  Section 4 uses that method, but Section 2
> does not.  I often found myself searching for the correct wgetrc startup
> file format after reading up on an the option.  As an example, Section 2
> tells you that `-l depth' or `--level=depth' can be used as recursion
> depth options, but you have to do a bit of searching to find out that
> "reclevel=depth" and not "level=depth" is the matching wgetrc command.

*Sigh*, yeah.

> Related to the same issue and for other Windows users who may search the
> archive: as a new user, it's nice to use the long form option, since it
> makes it easier to remember what you're trying to do.  However, a
> command line of 200 chars is hard to read.  I found myself organizing
> all my options into a customized wgetrc file for each site.  In Windows,
> each instance of wget started via a batch file would spawn it's own
> local environment, so I could run multiple copies of wget
> simultaneously, each initiated from a separate batch file and each with
> its own customized "set WGETRC=Site1-wgetrc.txt" followed by the basic
> "wget Site1.com" command.

Yes. This sort of thing is part of why I'm interested in seeing a
"--config" option created, which Julien has expressed an interest in
implementing. #inclusion of config files could be useful too, though I'm
planning a probably complete rework of the config file syntax (something
that can support URL-specific configuration), so perhaps such a feature
would be more easily introduced at that time.

New config syntax might provide a good opportunity to move accept/reject
lists to a different method of accomplishing the same thing. -A and -R
have some inherent limitations (even after you address the
newly-documented issues); you can specify that Wget only accept things
that match FOO _and_ don't match BAR, but you can't specify things that
match FOO _or_ don't match BAR. An expression syntax--or even just a
jump table similar to how PAM or Linux iptables work--could offer a much
more powerful system of determining whether an URL is "acceptable" or not.

Something like:

  if filename matches "*.html":
      accept = yes
      delete-after = yes
  else filename matches "*.php":
      if parameter "Logout" matches "*y*":
          accept = no
      else:
          accept = yes
  else:
      accept = no

That, obviously, would be somewhere down-the-road. In the mean time, the
less-flexible interface works fine, provided that we remove some of the
unusual "surprises".

The easiest way to do query strings, for this point in time, might be a
flag to toggle inclusion of the query string in matches; people who want
to match query strings for _some_ things but not for others can just
craft their wildcards appropriately (-A '*.php,*.php\?*' to avoid
matching query strings).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH6Wn67M8hyUobTrERAgdtAJ9oxXJPezxgcKx68cTxCb1BpzbbqACeKo8c
zR6XSCf/pIrfdSQQaG+04f0=
=HgRE
-----END PGP SIGNATURE-----