Micah Cowan
Tue, 25 Mar 2008 14:09:44 -0700
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Todd Pattist wrote:
> Micah Cowan wrote:
>> Announcing the release of version 1.11.1 of GNU Wget.
>> ** Documentation of accept/reject lists in the manual's "Types of
>> Files" section now explains various aspects of their behavior that may
>> be surprising, and notes that they may change in the future.
> I'm glad to see that this made it into the docs - even if this behavior
> is drastically altered in the next rev.
Er, yeah. And, despite my complaining a bit that your version was a bit
longer than I wanted to spend explaining things, my version ended up
longer. :\
> I'm interested in your thoughts on the future of the accept/reject
> filter options. Currently, accept/reject provides mixed control over
> file retention and link traversal. Those filters do not apply to html
> files during the first pass through those filters (for traversal), but
> do apply during during the second pass for file retention.
> I can see splitting the accept/reject filters into two independent
> filter sets. One set would follow/no-follow links and the other set
> would keep/delete files after retrieval. Obviously query string
> matching would be nice in the first set. OTOH, I can imagine keeping
> accept/reject solely to control file retention and using more advanced
> logic than simple htm/html extension matching to get deeper traversal of
> script sites when permitted by the recursion depth or other controls.
> What do you see as the best approach?
That pretty much matches my thinking, and in fact also matches previous
discussions on the subject, I believe (except that I can't seem to
_find_ the discussion I'm thinking of).
> As long as I'm posting, I'll give some very minor feedback on the docs.
> It would be nice to have a cross reference of the three formats - short
> option, long option and control file or just list all 3 in the first
> discussion of the option. Section 4 uses that method, but Section 2
> does not. I often found myself searching for the correct wgetrc startup
> file format after reading up on an the option. As an example, Section 2
> tells you that `-l depth' or `--level=depth' can be used as recursion
> depth options, but you have to do a bit of searching to find out that
> "reclevel=depth" and not "level=depth" is the matching wgetrc command.
*Sigh*, yeah.
> Related to the same issue and for other Windows users who may search the
> archive: as a new user, it's nice to use the long form option, since it
> makes it easier to remember what you're trying to do. However, a
> command line of 200 chars is hard to read. I found myself organizing
> all my options into a customized wgetrc file for each site. In Windows,
> each instance of wget started via a batch file would spawn it's own
> local environment, so I could run multiple copies of wget
> simultaneously, each initiated from a separate batch file and each with
> its own customized "set WGETRC=Site1-wgetrc.txt" followed by the basic
> "wget Site1.com" command.
Yes. This sort of thing is part of why I'm interested in seeing a
"--config" option created, which Julien has expressed an interest in
implementing. #inclusion of config files could be useful too, though I'm
planning a probably complete rework of the config file syntax (something
that can support URL-specific configuration), so perhaps such a feature
would be more easily introduced at that time.
New config syntax might provide a good opportunity to move accept/reject
lists to a different method of accomplishing the same thing. -A and -R
have some inherent limitations (even after you address the
newly-documented issues); you can specify that Wget only accept things
that match FOO _and_ don't match BAR, but you can't specify things that
match FOO _or_ don't match BAR. An expression syntax--or even just a
jump table similar to how PAM or Linux iptables work--could offer a much
more powerful system of determining whether an URL is "acceptable" or not.
Something like:
if filename matches "*.html":
accept = yes
delete-after = yes
else filename matches "*.php":
if parameter "Logout" matches "*y*":
accept = no
else:
accept = yes
else:
accept = no
That, obviously, would be somewhere down-the-road. In the mean time, the
less-flexible interface works fine, provided that we remove some of the
unusual "surprises".
The easiest way to do query strings, for this point in time, might be a
flag to toggle inclusion of the query string in matches; people who want
to match query strings for _some_ things but not for others can just
craft their wildcards appropriately (-A '*.php,*.php\?*' to avoid
matching query strings).
- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFH6Wn67M8hyUobTrERAgdtAJ9oxXJPezxgcKx68cTxCb1BpzbbqACeKo8c
zR6XSCf/pIrfdSQQaG+04f0=
=HgRE
-----END PGP SIGNATURE-----