Micah Cowan wrote:
Announcing the release of version 1.11.1 of GNU Wget.
** Documentation of accept/reject lists in the manual's "Types of
Files" section now explains various aspects of their behavior that may
be surprising, and notes that they may change in the future.
I'm glad to see that this made it into the docs - even if this behavior
is drastically altered in the next rev.
I'm interested in your thoughts on the future of the accept/reject
filter options. Currently, accept/reject provides mixed control over
file retention and link traversal. Those filters do not apply to html
files during the first pass through those filters (for traversal), but
do apply during during the second pass for file retention.
I can see splitting the accept/reject filters into two independent
filter sets. One set would follow/no-follow links and the other set
would keep/delete files after retrieval. Obviously query string
matching would be nice in the first set. OTOH, I can imagine keeping
accept/reject solely to control file retention and using more advanced
logic than simple htm/html extension matching to get deeper traversal of
script sites when permitted by the recursion depth or other controls.
What do you see as the best approach?
As long as I'm posting, I'll give some very minor feedback on the docs.
It would be nice to have a cross reference of the three formats - short
option, long option and control file or just list all 3 in the first
discussion of the option. Section 4 uses that method, but Section 2
does not. I often found myself searching for the correct wgetrc startup
file format after reading up on an the option. As an example, Section 2
tells you that `-l depth' or `--level=depth' can be used as recursion
depth options, but you have to do a bit of searching to find out that
"reclevel=depth" and not "level=depth" is the matching wgetrc command.
Related to the same issue and for other Windows users who may search the
archive: as a new user, it's nice to use the long form option, since it
makes it easier to remember what you're trying to do. However, a
command line of 200 chars is hard to read. I found myself organizing
all my options into a customized wgetrc file for each site. In Windows,
each instance of wget started via a batch file would spawn it's own
local environment, so I could run multiple copies of wget
simultaneously, each initiated from a separate batch file and each with
its own customized "set WGETRC=Site1-wgetrc.txt" followed by the basic
"wget Site1.com" command.
** Documentation of --no-parents now explains how a trailing slash, or
lack thereof, in the specified URL, will affect behavior.
- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFH6DIn7M8hyUobTrERAvYMAJ9Ue10o87jff1xuZo5hHFzUwkI3oQCfWVTt
HikOEmEAIxjtzV1Pliji5g8=
=jO0N
-----END PGP SIGNATURE-----