Re: Accept and Reject - particularly for PHP and CGI sites

Todd Pattist Mon, 10 Mar 2008 15:30:22 -0700

This cleared up a lot. I really appreciate your reply. I've been usingthe log and the server_response = onparameters, but not --debug. I'll add that now and take a look, butyour 1..2..3.. answer below and the comment that accept/reject matchingis on the local filename explains what I'm seeing, From your comments,I'm confident I can get it to do what I want, with the only problembeing that I'll have to delete excess files. That's not really aproblem for me, as long as I understand what it is doing and why.


Micah Cowan wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Todd Pattist wrote:

Thank you for the quick response.  Background is I'm on Windows XP, Gnu
wget 1.11

This "doesn't affect traversal of HTML files" functionality is currently
implemented via a heuristic based on the filename extension. That is, if
it ends in ".htm" or ".html", I believe, then it will be traversed
regardless of -A or -R settings, whereas .cgi or .php will not affect
traversal.

I'm not sure I understand the "cgi or .php will not affect traversal."


I mean, it will not detect these as HTML files, so the accept/reject
rules will be applied to them without exception.

If I use wget to start at http://site.com/view.php?f=16 and recursively
mirror without -A or -R, it looks like it  traverses deeper as though
that page and other .php links are html files. This makes sense. (I say

looks like, because it takes a long time and produces lots of files).If I select the same page and add accept=site.com/view.php?id=16 to

wgetrc, no pages are saved and it does not traverse any deeper and it
takes only a second or two.  I see this in the log:

Saving to: `site.com/[EMAIL PROTECTED]'
Removing site.com/[EMAIL PROTECTED] since it should be rejected.

I recognize that the question mark was substituted for my OS, but that
does not matter on the accept filter.  What does matter is whether I
have the .html or not in the accept filter.  That surprises me.  Both
accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16*
will match and keep the
site.com/[EMAIL PROTECTED] file, while both
accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause
it not to match and generate the "Removing ... since it should be
rejected" line.  Regardless of the matching/saving this seems to control
traversal, as I get far deeper traversal with no accept= at all.


After another look at the relevant portions of the source code, it looks
like accept/reject rules are _always_ applied against the local
filename, contrary to what I'd been thinking. This needs to be changed.
(But it probably won't be, any time soon.)

Note that the view.php?id=16 doesn't mean what you may perhaps think it
does: Wget detects the "?" as a wildcard, and allows it to match any
character (including "@"). If you supplied "\?" instead (which matches a
literal question mark), I'm guessing it'd actually fail to match,
because it's checking against "@".

My understanding is that, when you specify a URL directly at the
command-line, it will be downloaded and traversed (if it turns out to be
HTML), no matter what the accept/reject rules are (which can still cause
it to be removed afterwards). Therefore, I suspect that what Wget does
with your URL when it isn't matching the accept rules is:

  1. Downloads the named file
  2. Discovers that, regardless of the filename, it is indeed an HTML
file, so scans it for all links to be downloaded.
  3. After scanning for all the links, it doesn't find any that end in
".html", nor any that match the accept rules, so it doesn't do anything
else.

- --debug will definitely tell you whether it's bothering to scan that
first file or not, and what it decides to do with the links it finds.

I'm pretty sure  I can control traversal of php links with accept and
reject, but I often want to traverse looking for certain file types, but
don't want to save all the php files traversed.


We're looking for more fine-grained controls to allow this sort of
thing, but at the moment, my understanding is that there is no control
over whether Wget traverses-and-then-deletes a given file: it will
_always_ do that for files it knows or suspects are HTML (based on .htm,
.html suffixes, or if, like the above example, it will download the
filename first anyway because it's an explicit command-line argument);
it will _never_ download/traverse any other sorts of links that do not
match the accept rules.

If something _does_ match the accept rules, and turns out after download
to be an HTML file (determined by the server's headers), it will
traverse it further; but of course it won't delete them afterward
because they matched the accept list.

I'd have to look at the relevant code, but it's possible that
"directory"-looking names may also be automatically traversed in that way.

I don't want you to do work I can do myself.  I was just hoping for a
link or some pointers that might help.


It looks like this idea was incorrect anyway; it's only based on the suffix.

Does
html_extension=on affect link traversal?

No; this only affects whether filenames are changed upon download to
explicitly include an ".html" extension (useful for local browsing).

It seems that the html extension is used in the filter matching of
accept/reject, and that seems to affect traversal as described above
unless I'm missing something (which is entirely possible).


Yes, it does; my bad.

I'd like to be able to
independently control link traversal vs. file retrieval with local file
storage.  Do the directory include/exclude commands allow this - do they
work differently from -A -R?

I'm afraid I'm unsure what you are asking here.

Is my question clearer from the above?  I'm seeing very quick exits
(seconds) when the accept filter does not match the start page.  To get
deeper traversing, I have to match, but then it saves the matched files
and the traverse takes hours, with perhaps thousands of html files

(converted from .php files), none of which I need.


Yes, the question is clearer, and unfortunately the answer is "not
currently". :\

3) Which has priority if both reject and accept filters match?

Not sure; it's easy enough to test this yourself, though.

I have done lots of testing, so you'd think this simple one would be
obvious.  The answer seems to be that reject is higher priority, since
identical accept= and reject= seem to produce no output.  This matches
what the manual says.


- --debug is your friend. It will tell you explicitly what it thinks about
the links it finds.

It might help to add to the manual that adding an
accept= filter causes a rejection of everything that does not match the
accept filter, even if there is no reject filter specified.  The fact
that specifically accepting some files turns on a default rejection of
everything else surprised me, since the normal default is to accept
everything.


It actually is in the manual, but probably not the Windows Help
documentation that you've got. My recollection is that the latter is
generated from the abbreviated reference which, on Unix, becomes the
"manpage".

The full manual is available, in various formats, at
http://www.gnu.org/software/wget/manual/.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH1aOL7M8hyUobTrERAiAnAJ4uk6R1jHaFEYkwScu9RKe6acGVQQCcC6lz
twxxUd2OzSjHEeSWZ/MVKOA=
=CHwi
-----END PGP SIGNATURE-----

Re: Accept and Reject - particularly for PHP and CGI sites

Reply via email to