Micah Cowan wrote:
Well, -E is special, true. But in general the second quote is (by
definition) correct.
- -E, obviously, _shouldn't_ be special...
I hope it's clear I'm not complaining. Wget is great and your efforts
are very much appreciated. I just wanted to document the behavior I was
seeing in a way that would help others. I actually like the current
behavior - now that I (more or less)understand it. I can add php to the
accept list, which controls traversing, and also optionally add html if
I want to keep the html files. If file retention was determined based
solely on the URL, then traversal and local file retention would be
inextricably linked.
I haven't yet quite figured out file extension matching versus string
matching in filenames, but extensions seem to match regardless of
leading characters or following ?id=1 parameters.
That's right; the "query" portion of the URL is not used to determine
matching. There are, of course, times when you specifically wish to tell
wget not to follow certain specific query strings (such as edit or print
or... in wikis); wget doesn't currently support this (I plan to fix this).
Now I'm confused again. I suppose I can go through more trial and error
or dig through the source to figure out what it's really doing, but in
hopes you can throw more light on this, I'll explicate what is confusing
me. (comments relate to wget 1.11 running on Windows XP)
Confusion 1: Right now, I'm only using file extensions in the accept=
parameters, such as accept=zip,jpg,gif,php etc. Even if the query
portion (the "?id=1" part of site.com/index.php?id=1) is not considered
during matching, it's not clear to me why accept=php matches
"site.com/index.php". Why don't I need *.php (Windows) or *php
(assuming the *glob matches the period). Would "accept=index" match
"index.php?id=1"? How about "accept=*index*" I assumed I could do an
accept match on the query portion, the filename portion, or even the
domain, but I suspect now that's wrong. The domain gets stripped off
when the local name is constructed, so I realize now I can't match on
that (local filename used for matching), but the query portion is
usually left as part of the filename, with an atsign replacing the
question mark. Is filename matching allowed or only extension matching?
Confusion 2: I'm rejecting based on the query string, usually after an
accept string allowing defined extensions. I think I understand this,
and I think it's working fine. I'm usually doing something like
reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout
links or thread subscription links in a phpbb setting. This works. I
think it's doing exactly what you say it's not yet capable of doing, but
maybe I'm missing something. Does the accept matching work differently
from the reject matching? Does reject work on the URL before retrieval,
but accept work on the local filename after retrieval? If the
site.com/index.php?mode=logout link was being traversed with
accept=php and reject=*logout*, I would be getting logged out, but I'm not.
Hmmmmm..... light perhaps begins to dawn. It looks like both accept and
reject are applied twice - once before retrieval and once after. To be
retrieved/traversed it has to pass both filters and then after local
renaming, it has to pass both again. That would fit what I'm seeing.
My reject filter prevents traversing logout links during the first pass
and my accept filter deletes php files during the second check after
html renaming.
Thanks for any comments or clarifications.