Micah Cowan wrote:
Well, -E is special, true. But in general the second quote is (by
definition) correct.

- -E, obviously, _shouldn't_ be special...

I hope it's clear I'm not complaining. Wget is great and your efforts are very much appreciated. I just wanted to document the behavior I was seeing in a way that would help others. I actually like the current behavior - now that I (more or less)understand it. I can add php to the accept list, which controls traversing, and also optionally add html if I want to keep the html files. If file retention was determined based solely on the URL, then traversal and local file retention would be inextricably linked.

I haven't yet quite figured out file extension matching versus string
matching in filenames, but extensions seem to match regardless of
leading characters or following ?id=1 parameters.

That's right; the "query" portion of the URL is not used to determine
matching. There are, of course, times when you specifically wish to tell
wget not to follow certain specific query strings (such as edit or print
or... in wikis); wget doesn't currently support this (I plan to fix this).

Now I'm confused again. I suppose I can go through more trial and error or dig through the source to figure out what it's really doing, but in hopes you can throw more light on this, I'll explicate what is confusing me. (comments relate to wget 1.11 running on Windows XP)

Confusion 1: Right now, I'm only using file extensions in the accept= parameters, such as accept=zip,jpg,gif,php etc. Even if the query portion (the "?id=1" part of site.com/index.php?id=1) is not considered during matching, it's not clear to me why accept=php matches "site.com/index.php". Why don't I need *.php (Windows) or *php (assuming the *glob matches the period). Would "accept=index" match "index.php?id=1"? How about "accept=*index*" I assumed I could do an accept match on the query portion, the filename portion, or even the domain, but I suspect now that's wrong. The domain gets stripped off when the local name is constructed, so I realize now I can't match on that (local filename used for matching), but the query portion is usually left as part of the filename, with an atsign replacing the question mark. Is filename matching allowed or only extension matching?

Confusion 2: I'm rejecting based on the query string, usually after an accept string allowing defined extensions. I think I understand this, and I think it's working fine. I'm usually doing something like reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout links or thread subscription links in a phpbb setting. This works. I think it's doing exactly what you say it's not yet capable of doing, but maybe I'm missing something. Does the accept matching work differently from the reject matching? Does reject work on the URL before retrieval, but accept work on the local filename after retrieval? If the site.com/index.php?mode=logout link was being traversed with
accept=php and reject=*logout*, I would be getting logged out, but I'm not.

Hmmmmm..... light perhaps begins to dawn. It looks like both accept and reject are applied twice - once before retrieval and once after. To be retrieved/traversed it has to pass both filters and then after local renaming, it has to pass both again. That would fit what I'm seeing. My reject filter prevents traversing logout links during the first pass and my accept filter deletes php files during the second check after html renaming.

Thanks for any comments or clarifications.

Reply via email to