Re: Accept and Reject - particularly for PHP and CGI sites

Todd Pattist Wed, 19 Mar 2008 07:41:00 -0700

Micah Cowan wrote:

After another look at the relevant portions of the source code, it looks
like accept/reject rules are _always_ applied against the local
filename, contrary to what I'd been thinking. This needs to be changed.
(But it probably won't be, any time soon.

.... and

If something _does_ match the accept rules, and turns out after download
to be an HTML file (determined by the server's headers), it will
traverse it further; but of course it won't delete them afterward
because they matched the accept list.

I'd like to help clarify for others who may read this how wget 1.11 isworking for .php, .cgi and similar files (on Windows, but I expect thebehavior is the same on other OSs). It has taken me a while to grokthis even partially. The first quote above is correct. The secondquote is not, at least not when you use html_extension = on as I do.The reason it's not correct is because the first quote is correct.

Suppose you want all the jpg and zip files on a site. If the site is anormal site with .html links on .html pages, all you need to do is addaccept = jpg,zip to wgetrc and recursively mirror the site. This acceptcommand will not affect the traversing of html links, so it willdownload the first page (index.html) parse it for html links, save allthe jpg and zip files it finds on the first page and traverse all theidentified html links on that page until it has reached the default of 5levels deep, saving the jpgs and zips as it goes. It is important toknow that the accept= command is a powerful rejection command, as itwill reject everything that does not match, except html links, whichwill be traversed regardless of the accept list, but not kept.

Now compare that behavior with the behavior of the same site where thelinks are all php links and the same accept = jpg,zip command is used.If the starting point is a php file (with one or more parameters addedon like http://site.com/index.php?id=1), the starting point index.phpfile will be requested. The server header will identify it to wget asan html file. The received page will be parsed for any html links(there will probably be none, since they will all be other php filelinks such as index.php?id=2) and parsed for any jpgs or zips (therewill be none of these either since they will probably look likefile.php?id=32 links). The index.php starting file will be locallyrenamed to .html. Since html is not on the accept list, this file willnot be kept. The result is no files and no traversing of deeper linksbeyond the first page despite the fact that there are lots of zip filesand jpg files available at deeper php link levels.

Now consider what happens if the accept list is expanded to add phpfiles with accept = php,jpg,zip (These match all files with matchingextensions.) The starting index.php file is brought in, locally renamedto html and parsed. Numerous php files are identified during thatparsing. Wget does not yet know if they are html files or jpgs or zips,but it knows they are php files that match the accept list. Thestarting file is thrown away after parsing, because it is now locally anhtml file, not a php file and html files are NOT on the accept list.The php files identified on the starting page, however, have made it tothe download list (because they are php files that match the acceptlist) and will be traversed if they are identified as html files as theyare received. They will not be kept, however, if they are html links,as html files are not on the accept list, and the deletion ofnon-matching accept list files is made based on the local name and localfile extension (which is now html, not php). If they are files that endjpg or zip, they will be kept (turn content_disposition = on). Rejectcan be used to override the accept parameters, if needed (reject =*logout* for forums would be typical to prevent logging out forauthenticated sites). Replace php with cgi or just add php,cgi, asp,etc. to the accept list for other server side script based sites.

I haven't yet quite figured out file extension matching versus stringmatching in filenames, but extensions seem to match regardless ofleading characters or following ?id=1 parameters. If anyone wants toexplicate this, I'd appreciate it. I hope this helps someone.

Re: Accept and Reject - particularly for PHP and CGI sites

Reply via email to