-----BEGIN PGP SIGNED MESSAGE-----
Todd Pattist wrote:
> Micah Cowan wrote:
>> Well, -E is special, true. But in general the second quote is (by
>> definition) correct.
>> - -E, obviously, _shouldn't_ be special...
> I hope it's clear I'm not complaining.
I didn't take it as complaining.
>>> I haven't yet quite figured out file extension matching versus string
>>> matching in filenames, but extensions seem to match regardless of
>>> leading characters or following ?id=1 parameters.
>> That's right; the "query" portion of the URL is not used to determine
>> matching. There are, of course, times when you specifically wish to tell
>> wget not to follow certain specific query strings (such as edit or print
>> or... in wikis); wget doesn't currently support this (I plan to fix
> Now I'm confused again. I suppose I can go through more trial and error
> or dig through the source to figure out what it's really doing, but in
> hopes you can throw more light on this, I'll explicate what is confusing
> me. (comments relate to wget 1.11 running on Windows XP)
> Confusion 1: Right now, I'm only using file extensions in the accept=
> parameters, such as accept=zip,jpg,gif,php etc. Even if the query
> portion (the "?id=1" part of site.com/index.php?id=1) is not considered
> during matching, it's not clear to me why accept=php matches
> "site.com/index.php". Why don't I need *.php (Windows) or *php
> (assuming the *glob matches the period). Would "accept=index" match
> "index.php?id=1"? How about "accept=*index*"
(This is in the documentation; at least the full documentation. See the
manual on the website; I think the Windows Help files that ship with
Wget are based on a "short" version of the manual).
The way the matching works is that, if there are any wildcard characters
(any of '*', '?', '[' or ']'), then it is a wildcard pattern; otherwise,
it's matched exactly against the filename suffix (not necessarily
extension). "php" will match index.php, or even "shophp", but not
"index.php.foo". "*.php" wouldn't match "shophp", since the period is
This is only ever matched against the filename, and never the domain,
directory, or query string (actually, as you've discovered, it's matched
against the _local_ filename for some cases, which needs to be fixed).
As I currently understand it from the code, at least for Wget 1.11,
matching is against the _URL_'s filename portion (and only that portion:
no query strings, no directories) when deciding whether it should
download something through a recursive descent (the relevant spot in the
code is in recur.c, marked by a comment starting "6. Check for
When deciding whether it should delete a file afterwards, however, it
uses the _local_ filename (relevant code also in recur.c, near "Either
- --delete-after was specified,"). I'm not positive, but this probably
means query strings _do_ matter in that case. :p
Confused? Coz I sure am!
> I assumed I could do an
> accept match on the query portion, the filename portion, or even the
> domain, but I suspect now that's wrong. The domain gets stripped off
> when the local name is constructed, so I realize now I can't match on
> that (local filename used for matching), but the query portion is
> usually left as part of the filename, with an atsign replacing the
> question mark. Is filename matching allowed or only extension matching?
Well.... there's a _separate_ option for matching/rejecting domain names
(which requires -H to be meaningful, since by default Wget only allows
hosts you've explicitly requested, plus any that result from redirections).
> Confusion 2: I'm rejecting based on the query string, usually after an
> accept string allowing defined extensions. I think I understand this,
> and I think it's working fine. I'm usually doing something like
> reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout
> links or thread subscription links in a phpbb setting. This works. I
> think it's doing exactly what you say it's not yet capable of doing, but
> maybe I'm missing something. Does the accept matching work differently
> from the reject matching?
They use _exactly_ the same code.
> Does reject work on the URL before retrieval,
> but accept work on the local filename after retrieval? If the
> site.com/index.php?mode=logout link was being traversed with
> accept=php and reject=*logout*, I would be getting logged out, but I'm not.
What site is it? You might run wget with --debug to find out _exactly_
why it doesn't traverse these (see
for an enumeration of various messages Wget uses to say why something
isn't downloaded). Some sites are intelligent enough to include a
"rel=nofollow" or "nofollow" attribute in their anchor tags, which Wget
will obey unless -e robots=off was specified. The MoinMoin wiki
software, for instance, will do this (which is what the Wget Wgiki runs on).
> Hmmmmm..... light perhaps begins to dawn. It looks like both accept and
> reject are applied twice - once before retrieval and once after.
> To be retrieved/traversed it has to pass both filters and then after local
> renaming, it has to pass both again. That would fit what I'm seeing. My
> reject filter prevents traversing logout links during the first pass and
> my accept filter deletes php files during the second check after html
I think it's probably not preventing the traversal, but that traversal
is being prevented by other means.
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
-----END PGP SIGNATURE-----