Re: Accept and Reject - particularly for PHP and CGI sites

Todd Pattist Mon, 10 Mar 2008 13:36:40 -0700

Thank you for the quick response. Background is I'm on Windows XP, Gnu wget 1.11

This "doesn't affect traversal of HTML files" functionality is currently
implemented via a heuristic based on the filename extension. That is, if
it ends in ".htm" or ".html", I believe, then it will be traversed
regardless of -A or -R settings, whereas .cgi or .php will not affect
traversal.

I'm not sure I understand the "cgi or .php will not affect traversal." If I use wget to start at http://site.com/view.php?f=16 and recursively mirror without -A or -R, it looks like it traverses deeper as though that page and other .php links are html files. This makes sense. (I say looks like, because it takes a long time and produces lots of files). If I select the same page and add accept=site.com/view.php?id=16 to wgetrc, no pages are saved and it does not traverse any deeper and it takes only a second or two. I see this in the log:

Saving to: `site.com/[EMAIL PROTECTED]'
Removing site.com/[EMAIL PROTECTED] since it should be rejected.

I recognize that the question mark was substituted for my OS, but that does not matter on the accept filter. What does matter is whether I have the .html or not in the accept filter. That surprises me. Both accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16* will match and keep the
site.com/[EMAIL PROTECTED] file, while both accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause it not to match and generate the "Removing ... since it should be rejected" line. Regardless of the matching/saving this seems to control traversal, as I get far deeper traversal with no accept= at all.

I'm pretty sure I can control traversal of php links with accept and reject, but I often want to traverse looking for certain file types, but don't want to save all the php files traversed.

I'd have to look at the relevant code, but it's possible that
"directory"-looking names may also be automatically traversed in that way.

I don't want you to do work I can do myself. I was just hoping for a link or some pointers that might help.

Does
html_extension=on affect link traversal?


No; this only affects whether filenames are changed upon download to
explicitly include an ".html" extension (useful for local browsing).

It seems that the html extension is used in the filter matching of accept/reject, and that seems to affect traversal as described above unless I'm missing something (which is entirely possible).

I'd like to be able to
independently control link traversal vs. file retrieval with local file
storage.  Do the directory include/exclude commands allow this - do they
work differently from -A -R?


I'm afraid I'm unsure what you are asking here.

Is my question clearer from the above? I'm seeing very quick exits (seconds) when the accept filter does not match the start page. To get deeper traversing, I have to match, but then it saves the matched files and the traverse takes hours, with perhaps thousands of html files (converted from .php files), none of which I need.

2) The logs seem to show PHP files being retrieved and then not saved.
When mirroring a forum, you often want to exclude links that do a
logout, or subscribe you to a topic.  Does -R prevent a dynamically
generated html page from a PHP link from being traversed?


I think I'd need to see an example log of files "being retrieved and
then not saved", to understand what you mean.

I put a log of this type above. By adjusting accept and reject, I can exclude traversing a logout .php link (which I want to do), but I can't seem to traverse links I want to traverse without also saving them locally. It's not critical to resolve this for me, as I can always delete what I don't want, but it is confusing. I wanted to make sure I wasn't missing something.

3) Which has priority if both reject and accept filters match?

Not sure; it's easy enough to test this yourself, though.

I have done lots of testing, so you'd think this simple one would be obvious. The answer seems to be that reject is higher priority, since identical accept= and reject= seem to produce no output. This matches what the manual says. It might help to add to the manual that adding an accept= filter causes a rejection of everything that does not match the accept filter, even if there is no reject filter specified. The fact that specifically accepting some files turns on a default rejection of everything else surprised me, since the normal default is to accept everything.

As a matter of interest, httrack uses the opposite logic. Adding a specific accept in httrack has no effect if there is no reject. Thus, the most common format is to reject everything followed by a list of filetypes to accept. The wget procedure is more efficient since you don't need the starting "reject everything," and why would you accept if you didn't want to reject something else, but it would help the beginner, like me, to make that behavior clear in the manual.

4) Sometimes the OS restricts filename characters.  Do the -A and -R
filters match on the final name used to store the file, or on the name
at the server?

They should match the server's name (which includes the
Content-Disposition name, if that's being used); however, there were at
least some situations where the local name was being matched (there was
the case when -nd was being used, at least); I can't recall whether that
was resolved yet, I'm guessing not.

I do see the .html being required for a match when html_extension=on, as above. I don't see the ?/@ as making any difference. With Content-Disposition turned on, the final name is used for matching, but I've noticed that the file ends up in the root, not the correct directory.

BTW, thanks for a great program. This exchange has helped.

Re: Accept and Reject - particularly for PHP and CGI sites

Reply via email to