Micah Cowan wrote:
After another look at the relevant portions of the source code, it looks
like accept/reject rules are _always_ applied against the local
filename, contrary to what I'd been thinking. This needs to be changed.
(But it probably won't be, any time soon.
.... and
If something _does_ match the accept rules, and turns out after download
to be an HTML file (determined by the server's headers), it will
traverse it further; but of course it won't delete them afterward
because they matched the accept list.

I'd like to help clarify for others who may read this how wget 1.11 is working for .php, .cgi and similar files (on Windows, but I expect the behavior is the same on other OSs). It has taken me a while to grok this even partially. The first quote above is correct. The second quote is not, at least not when you use html_extension = on as I do. The reason it's not correct is because the first quote is correct.

Suppose you want all the jpg and zip files on a site. If the site is a normal site with .html links on .html pages, all you need to do is add accept = jpg,zip to wgetrc and recursively mirror the site. This accept command will not affect the traversing of html links, so it will download the first page (index.html) parse it for html links, save all the jpg and zip files it finds on the first page and traverse all the identified html links on that page until it has reached the default of 5 levels deep, saving the jpgs and zips as it goes. It is important to know that the accept= command is a powerful rejection command, as it will reject everything that does not match, except html links, which will be traversed regardless of the accept list, but not kept.

Now compare that behavior with the behavior of the same site where the links are all php links and the same accept = jpg,zip command is used. If the starting point is a php file (with one or more parameters added on like http://site.com/index.php?id=1), the starting point index.php file will be requested. The server header will identify it to wget as an html file. The received page will be parsed for any html links (there will probably be none, since they will all be other php file links such as index.php?id=2) and parsed for any jpgs or zips (there will be none of these either since they will probably look like file.php?id=32 links). The index.php starting file will be locally renamed to .html. Since html is not on the accept list, this file will not be kept. The result is no files and no traversing of deeper links beyond the first page despite the fact that there are lots of zip files and jpg files available at deeper php link levels.

Now consider what happens if the accept list is expanded to add php files with accept = php,jpg,zip (These match all files with matching extensions.) The starting index.php file is brought in, locally renamed to html and parsed. Numerous php files are identified during that parsing. Wget does not yet know if they are html files or jpgs or zips, but it knows they are php files that match the accept list. The starting file is thrown away after parsing, because it is now locally an html file, not a php file and html files are NOT on the accept list. The php files identified on the starting page, however, have made it to the download list (because they are php files that match the accept list) and will be traversed if they are identified as html files as they are received. They will not be kept, however, if they are html links, as html files are not on the accept list, and the deletion of non-matching accept list files is made based on the local name and local file extension (which is now html, not php). If they are files that end jpg or zip, they will be kept (turn content_disposition = on). Reject can be used to override the accept parameters, if needed (reject = *logout* for forums would be typical to prevent logging out for authenticated sites). Replace php with cgi or just add php,cgi, asp, etc. to the accept list for other server side script based sites.

I haven't yet quite figured out file extension matching versus string matching in filenames, but extensions seem to match regardless of leading characters or following ?id=1 parameters. If anyone wants to explicate this, I'd appreciate it. I hope this helps someone.


Reply via email to