Thank you for the quick response. Background is I'm on Windows XP, Gnu
wget 1.11I'm not sure I understand the "cgi or .php will not affect traversal." If I use wget to start at http://site.com/view.php?f=16 and recursively mirror without -A or -R, it looks like it traverses deeper as though that page and other .php links are html files. This makes sense. (I say looks like, because it takes a long time and produces lots of files). If I select the same page and add accept=site.com/view.php?id=16 to wgetrc, no pages are saved and it does not traverse any deeper and it takes only a second or two. I see this in the log:This "doesn't affect traversal of HTML files" functionality is currently implemented via a heuristic based on the filename extension. That is, if it ends in ".htm" or ".html", I believe, then it will be traversed regardless of -A or -R settings, whereas .cgi or .php will not affect traversal. Saving to: `site.com/[EMAIL PROTECTED]' Removing site.com/[EMAIL PROTECTED] since it should be rejected. I recognize that the question mark was substituted for my OS, but that does not matter on the accept filter. What does matter is whether I have the .html or not in the accept filter. That surprises me. Both accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16* will match and keep the site.com/[EMAIL PROTECTED] file, while both accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause it not to match and generate the "Removing ... since it should be rejected" line. Regardless of the matching/saving this seems to control traversal, as I get far deeper traversal with no accept= at all. I'm pretty sure I can control traversal of php links with accept and reject, but I often want to traverse looking for certain file types, but don't want to save all the php files traversed. I don't want you to do work I can do myself. I was just hoping for a link or some pointers that might help.I'd have to look at the relevant code, but it's possible that "directory"-looking names may also be automatically traversed in that way. Does html_extension=on affect link traversal?No; this only affects whether filenames are changed upon download to explicitly include an ".html" extension (useful for local browsing). It seems that the html extension is used in the filter matching of accept/reject, and that seems to affect traversal as described above unless I'm missing something (which is entirely possible). Is my question clearer from the above? I'm seeing very quick exits (seconds) when the accept filter does not match the start page. To get deeper traversing, I have to match, but then it saves the matched files and the traverse takes hours, with perhaps thousands of html files (converted from .php files), none of which I need.I'd like to be able to independently control link traversal vs. file retrieval with local file storage. Do the directory include/exclude commands allow this - do they work differently from -A -R?I'm afraid I'm unsure what you are asking here. I put a log of this type above. By adjusting accept and reject, I can exclude traversing a logout .php link (which I want to do), but I can't seem to traverse links I want to traverse without also saving them locally. It's not critical to resolve this for me, as I can always delete what I don't want, but it is confusing. I wanted to make sure I wasn't missing something.2) The logs seem to show PHP files being retrieved and then not saved. When mirroring a forum, you often want to exclude links that do a logout, or subscribe you to a topic. Does -R prevent a dynamically generated html page from a PHP link from being traversed?I think I'd need to see an example log of files "being retrieved and then not saved", to understand what you mean. I have done lots of testing, so you'd think this simple one would be obvious. The answer seems to be that reject is higher priority, since identical accept= and reject= seem to produce no output. This matches what the manual says. It might help to add to the manual that adding an accept= filter causes a rejection of everything that does not match the accept filter, even if there is no reject filter specified. The fact that specifically accepting some files turns on a default rejection of everything else surprised me, since the normal default is to accept everything.3) Which has priority if both reject and accept filters match?Not sure; it's easy enough to test this yourself, though. As a matter of interest, httrack uses the opposite logic. Adding a specific accept in httrack has no effect if there is no reject. Thus, the most common format is to reject everything followed by a list of filetypes to accept. The wget procedure is more efficient since you don't need the starting "reject everything," and why would you accept if you didn't want to reject something else, but it would help the beginner, like me, to make that behavior clear in the manual. I do see the .html being required for a match when html_extension=on, as above. I don't see the ?/@ as making any difference. With Content-Disposition turned on, the final name is used for matching, but I've noticed that the file ends up in the root, not the correct directory.4) Sometimes the OS restricts filename characters. Do the -A and -R filters match on the final name used to store the file, or on the name at the server?They should match the server's name (which includes the Content-Disposition name, if that's being used); however, there were at least some situations where the local name was being matched (there was the case when -nd was being used, at least); I can't recall whether that was resolved yet, I'm guessing not. BTW, thanks for a great program. This exchange has helped. |
- Accept and Reject - particularly for PHP and CGI sites Todd Pattist
- Re: Accept and Reject - particularly for PHP and CGI sit... Micah Cowan
- Re: Accept and Reject - particularly for PHP and CGI... Todd Pattist
- Re: Accept and Reject - particularly for PHP and... Micah Cowan
- Re: Accept and Reject - particularly for PHP... Todd Pattist
- Re: Accept and Reject - particularly for PHP and CGI sit... Todd Pattist
- Re: Accept and Reject - particularly for PHP and CGI sit... Todd Pattist
- Re: Accept and Reject - particularly for PHP and CGI sit... Todd Pattist
- Re: Accept and Reject - particularly for PHP and CGI sit... Todd Pattist