Re: Accept and Reject - particularly for PHP and CGI sites
When deciding whether it should delete a file afterwards, however, it uses the _local_ filename (relevant code also in recur.c, near "Either --delete-after was specified,"). I'm not positive, but this probably means query strings _do_ matter in that case. :p Confused? Coz I sure am! I had thought there was already an issue filed against this, but upon searching discovered I was thinking of a couple related bug that had been closed. I've filed a new issue for this: https://savannah.gnu.org/bugs/?22670 I'm not sure whether this post should go into the buglist discussion or here, but I'll put it here. I have to say, I'm not sure this is properly classed as a bug. If accept/reject applies to the original URL filename, why should the code bother to apply it again to the local name? If filters don't pass the URL filename and wget doesn't retrieve the file, it can't save it. I assume the answer was to handle script and content_disposition cases where you don't know what you're going to get back. If you match only on URL, you'd have no way to control traversing separate from file retention, and that's something you definitely want. (It's the default for conventional html based sites.) To put it another way, I usually want to download all the php files, and traverse all that turn out to be html, but I may only want to keep the zips or jpgs. With two checks, one before download on the URL filename and another after download on the local filename, I've got some control in cgi, php script based sites that is similar to the control in a conventional html page site. If this behavior is changed, then you'd probably need to have two sets of accept/reject filters that could be defined separately, one set to control traversing, and one to control file retention. I'd actually prefer that, particularly with matching extended to the query string portion of the URL. Right now, it may be impossible to prevent traversing some links. If you don't want to traverse "index.php?mode=logout", but do want to get "index.php?mode=getfile" there's no way to do it since the URL filename is the same. In the short term, it would help to add something to the documentation in the accept/reject area, such as the following: The accept/reject filters are applied to the filename twice - once to the filename in the URL before downloading to determine if the file should be retrieved (and parsed for more links if it is determined after download to be an html file) and again to the local filename after it is retrieved to determine if it should be kept. The local filename after retrieval may be significantly different from the URL filename before retrieval for many reasons. These include: 1) The URL filename does not include any query string portion of the URL, such as the string "?topic=16" in the URL "http://site.com/index.php?topic=16". After download the file may be stored as the local filename "[EMAIL PROTECTED]". Accept/reject matching does not apply to the URL query string portion before download, but will apply after download when the query string is incorporated into the local filename. 2) When content disposition is on, the local filename may be completely different from the URL filename. The URL "index.php?getfile=21" may return a content disposition header producing a local file of "some_interesting_file.zip". 3) The -E (html extension) and sometimes the -nd (no directories) switches will alter the filename suffix by adding .html or .1 for duplicate files. If the URL filename in links found when the starting page is parsed do not pass the accept/reject filters, the links will not be followed and will not be parsed for more links unless the filename ends html or htm. If accept/reject filters are used on cgi, php, asp and similar script based sites the URL filename must pass the filters (without considering any query string portion) if the links are to be traversed/parsed, and the local filename must pass the filters if the retrieved files are to be retained.
Re: Accept and Reject - particularly for PHP and CGI sites
If we were going to leave this behavior in for some time, then I think it'd be appropriate to at least mention it (maybe I'll just mention it anyway, without a comprehensive explanation It would probably be sufficient to just add a very brief mention to the docs of 1.11, the two things that confused the heck out of me - 1) The accept/reject filters are applied twice, once to the URL filename before retrieval and once to the local filename after retrieval, and 2) A query string is not considered to be part of the URL filename. You can probably imagine my confusion when I saw [EMAIL PROTECTED] being saved. I then tried to prevent that link from being traversed with a match on part of the query string, and I'd see that file disappear, only to later realize it was traversed. I had no idea that the query string was not being considered during the acc/rej match, nor that the process was performed twice. I look forward to 1.12.
Re: Accept and Reject - particularly for PHP and CGI sites
Micah Cowan wrote: Well, -E is special, true. But in general the second quote is (by definition) correct. - -E, obviously, _shouldn't_ be special... I hope it's clear I'm not complaining. Wget is great and your efforts are very much appreciated. I just wanted to document the behavior I was seeing in a way that would help others. I actually like the current behavior - now that I (more or less)understand it. I can add php to the accept list, which controls traversing, and also optionally add html if I want to keep the html files. If file retention was determined based solely on the URL, then traversal and local file retention would be inextricably linked. I haven't yet quite figured out file extension matching versus string matching in filenames, but extensions seem to match regardless of leading characters or following ?id=1 parameters. That's right; the query portion of the URL is not used to determine matching. There are, of course, times when you specifically wish to tell wget not to follow certain specific query strings (such as edit or print or... in wikis); wget doesn't currently support this (I plan to fix this). Now I'm confused again. I suppose I can go through more trial and error or dig through the source to figure out what it's really doing, but in hopes you can throw more light on this, I'll explicate what is confusing me. (comments relate to wget 1.11 running on Windows XP) Confusion 1: Right now, I'm only using file extensions in the accept= parameters, such as accept=zip,jpg,gif,php etc. Even if the query portion (the ?id=1 part of site.com/index.php?id=1) is not considered during matching, it's not clear to me why accept=php matches site.com/index.php. Why don't I need *.php (Windows) or *php (assuming the *glob matches the period). Would accept=index match index.php?id=1? How about accept=*index* I assumed I could do an accept match on the query portion, the filename portion, or even the domain, but I suspect now that's wrong. The domain gets stripped off when the local name is constructed, so I realize now I can't match on that (local filename used for matching), but the query portion is usually left as part of the filename, with an atsign replacing the question mark. Is filename matching allowed or only extension matching? Confusion 2: I'm rejecting based on the query string, usually after an accept string allowing defined extensions. I think I understand this, and I think it's working fine. I'm usually doing something like reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout links or thread subscription links in a phpbb setting. This works. I think it's doing exactly what you say it's not yet capable of doing, but maybe I'm missing something. Does the accept matching work differently from the reject matching? Does reject work on the URL before retrieval, but accept work on the local filename after retrieval? If the site.com/index.php?mode=logout link was being traversed with accept=php and reject=*logout*, I would be getting logged out, but I'm not. Hm. light perhaps begins to dawn. It looks like both accept and reject are applied twice - once before retrieval and once after. To be retrieved/traversed it has to pass both filters and then after local renaming, it has to pass both again. That would fit what I'm seeing. My reject filter prevents traversing logout links during the first pass and my accept filter deletes php files during the second check after html renaming. Thanks for any comments or clarifications.
Re: Accept and Reject - particularly for PHP and CGI sites
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Todd Pattist wrote: Micah Cowan wrote: Well, -E is special, true. But in general the second quote is (by definition) correct. - -E, obviously, _shouldn't_ be special... I hope it's clear I'm not complaining. I didn't take it as complaining. I haven't yet quite figured out file extension matching versus string matching in filenames, but extensions seem to match regardless of leading characters or following ?id=1 parameters. That's right; the query portion of the URL is not used to determine matching. There are, of course, times when you specifically wish to tell wget not to follow certain specific query strings (such as edit or print or... in wikis); wget doesn't currently support this (I plan to fix this). Now I'm confused again. I suppose I can go through more trial and error or dig through the source to figure out what it's really doing, but in hopes you can throw more light on this, I'll explicate what is confusing me. (comments relate to wget 1.11 running on Windows XP) Confusion 1: Right now, I'm only using file extensions in the accept= parameters, such as accept=zip,jpg,gif,php etc. Even if the query portion (the ?id=1 part of site.com/index.php?id=1) is not considered during matching, it's not clear to me why accept=php matches site.com/index.php. Why don't I need *.php (Windows) or *php (assuming the *glob matches the period). Would accept=index match index.php?id=1? How about accept=*index* (This is in the documentation; at least the full documentation. See the manual on the website; I think the Windows Help files that ship with Wget are based on a short version of the manual). The way the matching works is that, if there are any wildcard characters (any of '*', '?', '[' or ']'), then it is a wildcard pattern; otherwise, it's matched exactly against the filename suffix (not necessarily extension). php will match index.php, or even shophp, but not index.php.foo. *.php wouldn't match shophp, since the period is right there. This is only ever matched against the filename, and never the domain, directory, or query string (actually, as you've discovered, it's matched against the _local_ filename for some cases, which needs to be fixed). As I currently understand it from the code, at least for Wget 1.11, matching is against the _URL_'s filename portion (and only that portion: no query strings, no directories) when deciding whether it should download something through a recursive descent (the relevant spot in the code is in recur.c, marked by a comment starting 6. Check for acceptance/rejection rules.). When deciding whether it should delete a file afterwards, however, it uses the _local_ filename (relevant code also in recur.c, near Either - --delete-after was specified,). I'm not positive, but this probably means query strings _do_ matter in that case. :p Confused? Coz I sure am! I assumed I could do an accept match on the query portion, the filename portion, or even the domain, but I suspect now that's wrong. The domain gets stripped off when the local name is constructed, so I realize now I can't match on that (local filename used for matching), but the query portion is usually left as part of the filename, with an atsign replacing the question mark. Is filename matching allowed or only extension matching? Well there's a _separate_ option for matching/rejecting domain names (which requires -H to be meaningful, since by default Wget only allows hosts you've explicitly requested, plus any that result from redirections). Confusion 2: I'm rejecting based on the query string, usually after an accept string allowing defined extensions. I think I understand this, and I think it's working fine. I'm usually doing something like reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout links or thread subscription links in a phpbb setting. This works. I think it's doing exactly what you say it's not yet capable of doing, but maybe I'm missing something. Does the accept matching work differently from the reject matching? They use _exactly_ the same code. Does reject work on the URL before retrieval, but accept work on the local filename after retrieval? If the site.com/index.php?mode=logout link was being traversed with accept=php and reject=*logout*, I would be getting logged out, but I'm not. What site is it? You might run wget with --debug to find out _exactly_ why it doesn't traverse these (see http://wget.addictivecode.org/FrequentlyAskedQuestions#not-downloading for an enumeration of various messages Wget uses to say why something isn't downloaded). Some sites are intelligent enough to include a rel=nofollow or nofollow attribute in their anchor tags, which Wget will obey unless -e robots=off was specified. The MoinMoin wiki software, for instance, will do this (which is what the Wget Wgiki runs on). Hm. light perhaps begins to dawn. It looks like both
Re: Accept and Reject - particularly for PHP and CGI sites
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Micah Cowan wrote: As I currently understand it from the code, at least for Wget 1.11, matching is against the _URL_'s filename portion (and only that portion: no query strings, no directories) when deciding whether it should download something through a recursive descent (the relevant spot in the code is in recur.c, marked by a comment starting 6. Check for acceptance/rejection rules.). When deciding whether it should delete a file afterwards, however, it uses the _local_ filename (relevant code also in recur.c, near Either --delete-after was specified,). I'm not positive, but this probably means query strings _do_ matter in that case. :p Confused? Coz I sure am! I had thought there was already an issue filed against this, but upon searching discovered I was thinking of a couple related bug that had been closed. I've filed a new issue for this: https://savannah.gnu.org/bugs/?22670 - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH4Z4b7M8hyUobTrERArmXAJ903pvCI2VFSzM+sa1x9L44zmGv/QCfT7oz cE+dGJF+/Ehr3kGi4nGfbM8= =M3yS -END PGP SIGNATURE-
Re: Accept and Reject - particularly for PHP and CGI sites
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Todd Pattist wrote: I'm having trouble understanding how accept and reject work, particularly in the context of sites that rely on CGI and PHP to dynamically generate html pages. My questions relate to the following: 1) I don't fully understand the -A and -R effects and the difference, if any, between what links are traversed and parsed for deeper links, versus what files are kept and stored locally. The docs seem to say that -A and -R have no effect on the link traverse for html files, but this doesn't seem true for dynamically generated CGI, PHP files. This doesn't affect traversal of HTML files functionality is currently implemented via a heuristic based on the filename extension. That is, if it ends in .htm or .html, I believe, then it will be traversed regardless of -A or -R settings, whereas .cgi or .php will not affect traversal. I'd have to look at the relevant code, but it's possible that directory-looking names may also be automatically traversed in that way. Does html_extension=on affect link traversal? No; this only affects whether filenames are changed upon download to explicitly include an .html extension (useful for local browsing). I'd like to be able to independently control link traversal vs. file retrieval with local file storage. Do the directory include/exclude commands allow this - do they work differently from -A -R? I'm afraid I'm unsure what you are asking here. 2) The logs seem to show PHP files being retrieved and then not saved. When mirroring a forum, you often want to exclude links that do a logout, or subscribe you to a topic. Does -R prevent a dynamically generated html page from a PHP link from being traversed? I think I'd need to see an example log of files being retrieved and then not saved, to understand what you mean. 3) Which has priority if both reject and accept filters match? Not sure; it's easy enough to test this yourself, though. 4) Sometimes the OS restricts filename characters. Do the -A and -R filters match on the final name used to store the file, or on the name at the server? They should match the server's name (which includes the Content-Disposition name, if that's being used); however, there were at least some situations where the local name was being matched (there was the case when -nd was being used, at least); I can't recall whether that was resolved yet, I'm guessing not. Please feel free to report any other cases you encounter, where local transformations result in erroneous matches from -A/-R. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH1Xop7M8hyUobTrERAjRSAJ4o5RsliyGZ52mRTeuS75e8oR/lYACgg0DU KFDXK8QMOJI2NLJqAK+HDP0= =uP/C -END PGP SIGNATURE-
Re: Accept and Reject - particularly for PHP and CGI sites
Thank you for the quick response. Background is I'm on Windows XP, Gnu wget 1.11 This "doesn't affect traversal of HTML files" functionality is currently implemented via a heuristic based on the filename extension. That is, if it ends in ".htm" or ".html", I believe, then it will be traversed regardless of -A or -R settings, whereas .cgi or .php will not affect traversal. I'm not sure I understand the "cgi or .php will not affect traversal." If I use wget to start at http://site.com/view.php?f=16 and recursively mirror without -A or -R, it looks like it traverses deeper as though that page and other .php links are html files. This makes sense. (I say looks like, because it takes a long time and produces lots of files). If I select the same page and add accept=site.com/view.php?id=16 to wgetrc, no pages are saved and it does not traverse any deeper and it takes only a second or two. I see this in the log: Saving to: `site.com/[EMAIL PROTECTED]' Removing site.com/[EMAIL PROTECTED] since it should be rejected. I recognize that the question mark was substituted for my OS, but that does not matter on the accept filter. What does matter is whether I have the .html or not in the accept filter. That surprises me. Both accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16* will match and keep the site.com/[EMAIL PROTECTED] file, while both accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause it not to match and generate the "Removing ... since it should be rejected" line. Regardless of the matching/saving this seems to control traversal, as I get far deeper traversal with no accept= at all. I'm pretty sure I can control traversal of php links with accept and reject, but I often want to traverse looking for certain file types, but don't want to save all the php files traversed. I'd have to look at the relevant code, but it's possible that "directory"-looking names may also be automatically traversed in that way. I don't want you to do work I can do myself. I was just hoping for a link or some pointers that might help. Does html_extension=on affect link traversal? No; this only affects whether filenames are changed upon download to explicitly include an ".html" extension (useful for local browsing). It seems that the html extension is used in the filter matching of accept/reject, and that seems to affect traversal as described above unless I'm missing something (which is entirely possible). I'd like to be able to independently control link traversal vs. file retrieval with local file storage. Do the directory include/exclude commands allow this - do they work differently from -A -R? I'm afraid I'm unsure what you are asking here. Is my question clearer from the above? I'm seeing very quick exits (seconds) when the accept filter does not match the start page. To get deeper traversing, I have to match, but then it saves the matched files and the traverse takes hours, with perhaps thousands of html files (converted from .php files), none of which I need. 2) The logs seem to show PHP files being retrieved and then not saved. When mirroring a forum, you often want to exclude links that do a logout, or subscribe you to a topic. Does -R prevent a dynamically generated html page from a PHP link from being traversed? I think I'd need to see an example log of files "being retrieved and then not saved", to understand what you mean. I put a log of this type above. By adjusting accept and reject, I can exclude traversing a logout .php link (which I want to do), but I can't seem to traverse links I want to traverse without also saving them locally. It's not critical to resolve this for me, as I can always delete what I don't want, but it is confusing. I wanted to make sure I wasn't missing something. 3) Which has priority if both reject and accept filters match? Not sure; it's easy enough to test this yourself, though. I have done lots of testing, so you'd think this simple one would be obvious. The answer seems to be that reject is higher priority, since identical accept= and reject= seem to produce no output. This matches what the manual says. It might help to add to the manual that adding an accept= filter causes a rejection of everything that does not match the accept filter, even if there is no reject filter specified. The fact that specifically accepting some files turns on a default rejection of everything else surprised me, since the normal default is to accept everything. As a matter of interest, httrack uses the opposite logic. Adding a specific accept in httrack has no effect if there is no reject. Thus, the most common format is to reject everything followed by a list of filetypes to accept. The wget procedure is more efficient since you don't need the starting "reject everything," and why would you accept if you didn't want to reject something else, but it
Re: Accept and Reject - particularly for PHP and CGI sites
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Todd Pattist wrote: Thank you for the quick response. Background is I'm on Windows XP, Gnu wget 1.11 This doesn't affect traversal of HTML files functionality is currently implemented via a heuristic based on the filename extension. That is, if it ends in .htm or .html, I believe, then it will be traversed regardless of -A or -R settings, whereas .cgi or .php will not affect traversal. I'm not sure I understand the cgi or .php will not affect traversal. I mean, it will not detect these as HTML files, so the accept/reject rules will be applied to them without exception. If I use wget to start at http://site.com/view.php?f=16 and recursively mirror without -A or -R, it looks like it traverses deeper as though that page and other .php links are html files. This makes sense. (I say looks like, because it takes a long time and produces lots of files). If I select the same page and add accept=site.com/view.php?id=16 to wgetrc, no pages are saved and it does not traverse any deeper and it takes only a second or two. I see this in the log: Saving to: `site.com/[EMAIL PROTECTED]' Removing site.com/[EMAIL PROTECTED] since it should be rejected. I recognize that the question mark was substituted for my OS, but that does not matter on the accept filter. What does matter is whether I have the .html or not in the accept filter. That surprises me. Both accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16* will match and keep the site.com/[EMAIL PROTECTED] file, while both accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause it not to match and generate the Removing ... since it should be rejected line. Regardless of the matching/saving this seems to control traversal, as I get far deeper traversal with no accept= at all. After another look at the relevant portions of the source code, it looks like accept/reject rules are _always_ applied against the local filename, contrary to what I'd been thinking. This needs to be changed. (But it probably won't be, any time soon.) Note that the view.php?id=16 doesn't mean what you may perhaps think it does: Wget detects the ? as a wildcard, and allows it to match any character (including @). If you supplied \? instead (which matches a literal question mark), I'm guessing it'd actually fail to match, because it's checking against @. My understanding is that, when you specify a URL directly at the command-line, it will be downloaded and traversed (if it turns out to be HTML), no matter what the accept/reject rules are (which can still cause it to be removed afterwards). Therefore, I suspect that what Wget does with your URL when it isn't matching the accept rules is: 1. Downloads the named file 2. Discovers that, regardless of the filename, it is indeed an HTML file, so scans it for all links to be downloaded. 3. After scanning for all the links, it doesn't find any that end in .html, nor any that match the accept rules, so it doesn't do anything else. - --debug will definitely tell you whether it's bothering to scan that first file or not, and what it decides to do with the links it finds. I'm pretty sure I can control traversal of php links with accept and reject, but I often want to traverse looking for certain file types, but don't want to save all the php files traversed. We're looking for more fine-grained controls to allow this sort of thing, but at the moment, my understanding is that there is no control over whether Wget traverses-and-then-deletes a given file: it will _always_ do that for files it knows or suspects are HTML (based on .htm, .html suffixes, or if, like the above example, it will download the filename first anyway because it's an explicit command-line argument); it will _never_ download/traverse any other sorts of links that do not match the accept rules. If something _does_ match the accept rules, and turns out after download to be an HTML file (determined by the server's headers), it will traverse it further; but of course it won't delete them afterward because they matched the accept list. I'd have to look at the relevant code, but it's possible that directory-looking names may also be automatically traversed in that way. I don't want you to do work I can do myself. I was just hoping for a link or some pointers that might help. It looks like this idea was incorrect anyway; it's only based on the suffix. Does html_extension=on affect link traversal? No; this only affects whether filenames are changed upon download to explicitly include an .html extension (useful for local browsing). It seems that the html extension is used in the filter matching of accept/reject, and that seems to affect traversal as described above unless I'm missing something (which is entirely possible). Yes, it does; my bad. I'd like to be able to independently control link traversal vs. file retrieval
Re: Accept and Reject - particularly for PHP and CGI sites
This cleared up a lot. I really appreciate your reply. I've been using the log and the server_response = on parameters, but not --debug. I'll add that now and take a look, but your 1..2..3.. answer below and the comment that accept/reject matching is on the local filename explains what I'm seeing, From your comments, I'm confident I can get it to do what I want, with the only problem being that I'll have to delete excess files. That's not really a problem for me, as long as I understand what it is doing and why. Micah Cowan wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Todd Pattist wrote: Thank you for the quick response. Background is I'm on Windows XP, Gnu wget 1.11 This doesn't affect traversal of HTML files functionality is currently implemented via a heuristic based on the filename extension. That is, if it ends in .htm or .html, I believe, then it will be traversed regardless of -A or -R settings, whereas .cgi or .php will not affect traversal. I'm not sure I understand the cgi or .php will not affect traversal. I mean, it will not detect these as HTML files, so the accept/reject rules will be applied to them without exception. If I use wget to start at http://site.com/view.php?f=16 and recursively mirror without -A or -R, it looks like it traverses deeper as though that page and other .php links are html files. This makes sense. (I say looks like, because it takes a long time and produces lots of files). If I select the same page and add accept=site.com/view.php?id=16 to wgetrc, no pages are saved and it does not traverse any deeper and it takes only a second or two. I see this in the log: Saving to: `site.com/[EMAIL PROTECTED]' Removing site.com/[EMAIL PROTECTED] since it should be rejected. I recognize that the question mark was substituted for my OS, but that does not matter on the accept filter. What does matter is whether I have the .html or not in the accept filter. That surprises me. Both accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16* will match and keep the site.com/[EMAIL PROTECTED] file, while both accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause it not to match and generate the Removing ... since it should be rejected line. Regardless of the matching/saving this seems to control traversal, as I get far deeper traversal with no accept= at all. After another look at the relevant portions of the source code, it looks like accept/reject rules are _always_ applied against the local filename, contrary to what I'd been thinking. This needs to be changed. (But it probably won't be, any time soon.) Note that the view.php?id=16 doesn't mean what you may perhaps think it does: Wget detects the ? as a wildcard, and allows it to match any character (including @). If you supplied \? instead (which matches a literal question mark), I'm guessing it'd actually fail to match, because it's checking against @. My understanding is that, when you specify a URL directly at the command-line, it will be downloaded and traversed (if it turns out to be HTML), no matter what the accept/reject rules are (which can still cause it to be removed afterwards). Therefore, I suspect that what Wget does with your URL when it isn't matching the accept rules is: 1. Downloads the named file 2. Discovers that, regardless of the filename, it is indeed an HTML file, so scans it for all links to be downloaded. 3. After scanning for all the links, it doesn't find any that end in .html, nor any that match the accept rules, so it doesn't do anything else. - --debug will definitely tell you whether it's bothering to scan that first file or not, and what it decides to do with the links it finds. I'm pretty sure I can control traversal of php links with accept and reject, but I often want to traverse looking for certain file types, but don't want to save all the php files traversed. We're looking for more fine-grained controls to allow this sort of thing, but at the moment, my understanding is that there is no control over whether Wget traverses-and-then-deletes a given file: it will _always_ do that for files it knows or suspects are HTML (based on .htm, .html suffixes, or if, like the above example, it will download the filename first anyway because it's an explicit command-line argument); it will _never_ download/traverse any other sorts of links that do not match the accept rules. If something _does_ match the accept rules, and turns out after download to be an HTML file (determined by the server's headers), it will traverse it further; but of course it won't delete them afterward because they matched the accept list. I'd have to look at the relevant code, but it's possible that directory-looking names may also be automatically traversed in that way. I don't want you to do work I can do myself. I was just hoping for a link or some pointers that might help. It