subject:"Re\: Accept and Reject \- particularly for PHP and CGI sites"

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-20 Thread Todd Pattist






   When deciding whether it should delete a file afterwards, however, it
 uses the _local_ filename (relevant code also in recur.c, near "Either
 --delete-after was specified,"). I'm not positive, but this probably
 means query strings _do_ matter in that case. :p
 
 Confused? Coz I sure am!

I had thought there was already an issue filed against this, but upon
searching discovered I was thinking of a couple related bug that had
been closed. I've filed a new issue for this:

https://savannah.gnu.org/bugs/?22670


I'm not sure whether this post should go into the buglist discussion or
here, but I'll put it here.

I have to say, I'm not sure this is properly classed as a bug. If
accept/reject applies to the original URL filename, why should the code
bother to apply it again to the local name? If filters don't pass the
URL filename and wget doesn't retrieve the file, it can't save it. I
assume the answer was to handle script and content_disposition cases
where you don't know what you're going to get back. If you match only
on URL, you'd have no way to control traversing separate from file
retention, and that's something you definitely want. (It's the default
for conventional html based sites.) To put it another way, I usually
want to download all the php files, and traverse all that turn out to
be html, but I may only want to keep the zips or jpgs. With two
checks, one before download on the URL filename and another after
download on the local filename, I've got some control in cgi, php
script based sites that is similar to the control in a conventional
html page site. 

If this behavior is changed, then you'd probably need to have two sets
of accept/reject filters that could be defined separately, one set to
control traversing, and one to control file retention. I'd actually
prefer that, particularly with matching extended to the query string
portion of the URL. Right now, it may be impossible to prevent
traversing some links. If you don't want to traverse
"index.php?mode=logout", but do want to get "index.php?mode=getfile"
there's no way to do it since the URL filename is the same.

In the short term, it would help to add something to the documentation
in the accept/reject area, such as the following:

The accept/reject filters are applied to the
filename twice - once to the filename in the URL before downloading to
determine if the file should be retrieved (and parsed for more links if
it is determined after download to be an html file) and again to the
local filename after it is retrieved to determine if it should be
kept. The local filename after retrieval may be significantly
different from the URL filename before retrieval for many reasons.
These include:
1) The URL filename does not include any query string portion of the
URL, such as the string "?topic=16" in the URL
"http://site.com/index.php?topic=16". After download the file may be
stored as the local filename "[EMAIL PROTECTED]". Accept/reject
matching does not apply to the URL query string portion before
download, but will apply after download when the query string is
incorporated into the local filename.
2) When content disposition is on, the local filename may be completely
different from the URL filename. The URL "index.php?getfile=21" may
return a content disposition header producing a local file of
"some_interesting_file.zip".
3) The -E (html extension) and sometimes the -nd (no directories)
switches will alter the filename suffix by adding .html or .1 for
duplicate files.

If the URL filename in links found when the starting page is parsed do
not pass the accept/reject filters, the links will not be followed and
will not be parsed for more links unless the filename ends html or
htm. If accept/reject filters are used on cgi, php, asp and similar
script based sites the URL filename must pass the filters (without
considering any query string portion) if the links are to be
traversed/parsed, and the local filename must pass the filters if the
retrieved files are to be retained.

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-20 Thread Todd Pattist







  If we were going to leave this behavior in for some time, then I think
it'd be appropriate to at least mention it (maybe I'll just mention it
anyway, without a comprehensive explanation

It would probably be sufficient to just add a very brief mention
to the docs of 1.11, the two things that confused the heck out of me - 
1) The accept/reject filters are applied twice, once to the URL
filename before retrieval and once to the local filename after
retrieval, and
2) A query string is not considered to be part of the URL filename.

You can probably imagine my confusion when I saw [EMAIL PROTECTED]
being saved. I then tried to prevent that link from being traversed
with a match on part of the query string, and I'd see that file
disappear, only to later realize it was traversed. I had no idea that
the query string was not being considered during the acc/rej match, nor
that the process was performed twice.

I look forward to 1.12.

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-19 Thread Todd Pattist


Micah Cowan wrote:

Well, -E is special, true. But in general the second quote is (by
definition) correct.

- -E, obviously, _shouldn't_ be special...


I hope it's clear I'm not complaining.  Wget is great and your efforts 
are very much appreciated.  I just wanted to document the behavior I was 
seeing in a way that would help others.  I actually like the current 
behavior - now that I (more or less)understand it.  I can add php to the 
accept list, which controls traversing, and also optionally add html if 
I want to keep the html files.  If file retention was determined based 
solely on the URL, then traversal and local file retention would be 
inextricably linked.



I haven't yet quite figured out file extension matching versus string
matching in filenames, but extensions seem to match regardless of
leading characters or following ?id=1 parameters.


That's right; the query portion of the URL is not used to determine
matching. There are, of course, times when you specifically wish to tell
wget not to follow certain specific query strings (such as edit or print
or... in wikis); wget doesn't currently support this (I plan to fix this).


Now I'm confused again.  I suppose I can go through more trial and error 
 or dig through the source to figure out what it's really doing, but in 
hopes you can throw more light on this, I'll explicate what is confusing 
me. (comments relate to wget 1.11 running on Windows XP)


Confusion 1:  Right now, I'm only using file extensions in the accept= 
parameters, such as  accept=zip,jpg,gif,php  etc.  Even if the query 
portion (the ?id=1 part of site.com/index.php?id=1) is not considered 
during matching, it's not clear to me why accept=php matches 
site.com/index.php.  Why don't I need *.php (Windows) or *php 
(assuming the *glob matches the period).  Would accept=index match 
index.php?id=1? How about accept=*index*  I assumed I could do an 
accept match on the query portion, the filename portion, or even the 
domain, but I suspect now that's wrong.  The domain gets stripped off 
when the local name is constructed, so I realize now I can't match on 
that (local filename used for matching), but the query portion is 
usually left as part of the filename, with an atsign replacing the 
question mark.  Is filename matching allowed or only extension matching?


Confusion 2: I'm rejecting based on the query string, usually after an 
accept string allowing defined extensions.  I think I understand this, 
and I think it's working fine.  I'm usually doing something like 
reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout 
links or thread subscription links in a phpbb setting.  This works.  I 
think it's doing exactly what you say it's not yet capable of doing, but 
maybe I'm missing something.  Does the accept matching work differently 
from the reject matching?  Does reject work on the URL before retrieval, 
but accept work on the local filename after retrieval?  If the 
site.com/index.php?mode=logout link was being traversed with

accept=php and reject=*logout*, I would be getting logged out, but I'm not.

Hm. light perhaps begins to dawn.  It looks like both accept and 
reject are applied twice - once before retrieval and once after. To be 
retrieved/traversed it has to pass both filters and then after local 
renaming, it has to pass both again.  That would fit what I'm seeing. 
My reject filter prevents traversing logout links during the first pass 
and my accept filter deletes php files during the second check after 
html renaming.


Thanks for any comments or clarifications.

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-19 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Todd Pattist wrote:
 Micah Cowan wrote:
 Well, -E is special, true. But in general the second quote is (by
 definition) correct.

 - -E, obviously, _shouldn't_ be special...
 
 I hope it's clear I'm not complaining.

I didn't take it as complaining.

 I haven't yet quite figured out file extension matching versus string
 matching in filenames, but extensions seem to match regardless of
 leading characters or following ?id=1 parameters.

 That's right; the query portion of the URL is not used to determine
 matching. There are, of course, times when you specifically wish to tell
 wget not to follow certain specific query strings (such as edit or print
 or... in wikis); wget doesn't currently support this (I plan to fix
 this).
 
 Now I'm confused again.  I suppose I can go through more trial and error
  or dig through the source to figure out what it's really doing, but in
 hopes you can throw more light on this, I'll explicate what is confusing
 me. (comments relate to wget 1.11 running on Windows XP)
 
 Confusion 1:  Right now, I'm only using file extensions in the accept=
 parameters, such as  accept=zip,jpg,gif,php  etc.  Even if the query
 portion (the ?id=1 part of site.com/index.php?id=1) is not considered
 during matching, it's not clear to me why accept=php matches
 site.com/index.php.  Why don't I need *.php (Windows) or *php
 (assuming the *glob matches the period).  Would accept=index match
 index.php?id=1? How about accept=*index*

(This is in the documentation; at least the full documentation. See the
manual on the website; I think the Windows Help files that ship with
Wget are based on a short version of the manual).

The way the matching works is that, if there are any wildcard characters
(any of '*', '?', '[' or ']'), then it is a wildcard pattern; otherwise,
it's matched exactly against the filename suffix (not necessarily
extension). php will match index.php, or even shophp, but not
index.php.foo. *.php wouldn't match shophp, since the period is
right there.

This is only ever matched against the filename, and never the domain,
directory, or query string (actually, as you've discovered, it's matched
against the _local_ filename for some cases, which needs to be fixed).

As I currently understand it from the code, at least for Wget 1.11,
matching is against the _URL_'s filename portion (and only that portion:
no query strings, no directories) when deciding whether it should
download something through a recursive descent (the relevant spot in the
code is in recur.c, marked by a comment starting 6. Check for
acceptance/rejection rules.).

When deciding whether it should delete a file afterwards, however, it
uses the _local_ filename (relevant code also in recur.c, near Either
- --delete-after was specified,). I'm not positive, but this probably
means query strings _do_ matter in that case. :p

Confused? Coz I sure am!

 I assumed I could do an
 accept match on the query portion, the filename portion, or even the
 domain, but I suspect now that's wrong.  The domain gets stripped off
 when the local name is constructed, so I realize now I can't match on
 that (local filename used for matching), but the query portion is
 usually left as part of the filename, with an atsign replacing the
 question mark.  Is filename matching allowed or only extension matching?

Well there's a _separate_ option for matching/rejecting domain names
(which requires -H to be meaningful, since by default Wget only allows
hosts you've explicitly requested, plus any that result from redirections).

 Confusion 2: I'm rejecting based on the query string, usually after an
 accept string allowing defined extensions.  I think I understand this,
 and I think it's working fine.  I'm usually doing something like
 reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout
 links or thread subscription links in a phpbb setting.  This works.  I
 think it's doing exactly what you say it's not yet capable of doing, but
 maybe I'm missing something.  Does the accept matching work differently
 from the reject matching?

They use _exactly_ the same code.

 Does reject work on the URL before retrieval,
 but accept work on the local filename after retrieval?  If the
 site.com/index.php?mode=logout link was being traversed with
 accept=php and reject=*logout*, I would be getting logged out, but I'm not.

What site is it? You might run wget with --debug to find out _exactly_
why it doesn't traverse these (see
http://wget.addictivecode.org/FrequentlyAskedQuestions#not-downloading
for an enumeration of various messages Wget uses to say why something
isn't downloaded). Some sites are intelligent enough to include a
rel=nofollow or nofollow attribute in their anchor tags, which Wget
will obey unless -e robots=off was specified. The MoinMoin wiki
software, for instance, will do this (which is what the Wget Wgiki runs on).

 Hm. light perhaps begins to dawn.  It looks like both

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-19 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 As I currently understand it from the code, at least for Wget 1.11,
 matching is against the _URL_'s filename portion (and only that portion:
 no query strings, no directories) when deciding whether it should
 download something through a recursive descent (the relevant spot in the
 code is in recur.c, marked by a comment starting 6. Check for
 acceptance/rejection rules.).
 
 When deciding whether it should delete a file afterwards, however, it
 uses the _local_ filename (relevant code also in recur.c, near Either
 --delete-after was specified,). I'm not positive, but this probably
 means query strings _do_ matter in that case. :p
 
 Confused? Coz I sure am!

I had thought there was already an issue filed against this, but upon
searching discovered I was thinking of a couple related bug that had
been closed. I've filed a new issue for this:

https://savannah.gnu.org/bugs/?22670

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH4Z4b7M8hyUobTrERArmXAJ903pvCI2VFSzM+sa1x9L44zmGv/QCfT7oz
cE+dGJF+/Ehr3kGi4nGfbM8=
=M3yS
-END PGP SIGNATURE-

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-10 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Todd Pattist wrote:
 I'm having trouble understanding how accept and reject work,
 particularly in the context of sites that rely on CGI and PHP to
 dynamically generate html pages.  My questions relate to the following:
 
 1) I don't fully understand the -A and -R effects and the difference, if
 any, between what links are traversed and parsed for deeper links,
 versus what files are kept and stored locally.  The docs seem to say
 that -A and -R have no effect on the link traverse for html files, but
 this doesn't seem true for dynamically generated CGI, PHP files.

This doesn't affect traversal of HTML files functionality is currently
implemented via a heuristic based on the filename extension. That is, if
it ends in .htm or .html, I believe, then it will be traversed
regardless of -A or -R settings, whereas .cgi or .php will not affect
traversal.

I'd have to look at the relevant code, but it's possible that
directory-looking names may also be automatically traversed in that way.

 Does
 html_extension=on affect link traversal? 

No; this only affects whether filenames are changed upon download to
explicitly include an .html extension (useful for local browsing).

 I'd like to be able to
 independently control link traversal vs. file retrieval with local file
 storage.  Do the directory include/exclude commands allow this - do they
 work differently from -A -R?

I'm afraid I'm unsure what you are asking here.

 2) The logs seem to show PHP files being retrieved and then not saved.
 When mirroring a forum, you often want to exclude links that do a
 logout, or subscribe you to a topic.  Does -R prevent a dynamically
 generated html page from a PHP link from being traversed?

I think I'd need to see an example log of files being retrieved and
then not saved, to understand what you mean.

 3) Which has priority if both reject and accept filters match?

Not sure; it's easy enough to test this yourself, though.

 4) Sometimes the OS restricts filename characters.  Do the -A and -R
 filters match on the final name used to store the file, or on the name
 at the server?

They should match the server's name (which includes the
Content-Disposition name, if that's being used); however, there were at
least some situations where the local name was being matched (there was
the case when -nd was being used, at least); I can't recall whether that
was resolved yet, I'm guessing not.

Please feel free to report any other cases you encounter, where local
transformations result in erroneous matches from -A/-R.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH1Xop7M8hyUobTrERAjRSAJ4o5RsliyGZ52mRTeuS75e8oR/lYACgg0DU
KFDXK8QMOJI2NLJqAK+HDP0=
=uP/C
-END PGP SIGNATURE-

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-10 Thread Todd Pattist





Thank you for the quick response. Background is I'm on Windows XP, Gnu
wget 1.11

  This "doesn't affect traversal of HTML files" functionality is currently
implemented via a heuristic based on the filename extension. That is, if
it ends in ".htm" or ".html", I believe, then it will be traversed
regardless of -A or -R settings, whereas .cgi or .php will not affect
traversal.
  

I'm not sure I understand the "cgi or .php will not affect traversal."
If I use wget to start at http://site.com/view.php?f=16 and recursively
mirror without -A or -R, it looks like it traverses deeper as though
that page and other .php links are html files. This makes sense. (I say
looks like, because it takes a long time and produces lots of files).
If I select the same page and add accept=site.com/view.php?id=16 to
wgetrc, no pages are saved and it does not traverse any deeper and it
takes only a second or two. I see this in the log:

Saving to: `site.com/[EMAIL PROTECTED]'
Removing site.com/[EMAIL PROTECTED] since it should be rejected.

I recognize that the question mark was substituted for my OS, but that
does not matter on the accept filter. What does matter is whether I
have the .html or not in the accept filter. That surprises me. Both
accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16*
will match and keep the 
site.com/[EMAIL PROTECTED] file, while both
accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause
it not to match and generate the "Removing ... since it should be
rejected" line. Regardless of the matching/saving this seems to
control traversal, as I get far deeper traversal with no accept= at all.

I'm pretty sure I can control traversal of php links with accept and
reject, but I often want to traverse looking for certain file types,
but don't want to save all the php files traversed.


  
I'd have to look at the relevant code, but it's possible that
"directory"-looking names may also be automatically traversed in that way.
  

I don't want you to do work I can do myself. I was just hoping for a
link or some pointers that might help.


  
Does
html_extension=on affect link traversal? 

  
  
No; this only affects whether filenames are changed upon download to
explicitly include an ".html" extension (useful for local browsing).
  


It seems that the html extension is used in the filter matching of
accept/reject, and that seems to affect traversal as described above
unless I'm missing something (which is entirely possible).

  
I'd like to be able to
independently control link traversal vs. file retrieval with local file
storage.  Do the directory include/exclude commands allow this - do they
work differently from -A -R?

  
  
I'm afraid I'm unsure what you are asking here.
  

Is my question clearer from the above? I'm seeing very quick exits
(seconds) when the accept filter does not match the start page. To get
deeper traversing, I have to match, but then it saves the matched files
and the traverse takes hours, with perhaps thousands of html files
(converted from .php files), none of which I need. 


  
2) The logs seem to show PHP files being retrieved and then not saved.
When mirroring a forum, you often want to exclude links that do a
logout, or subscribe you to a topic.  Does -R prevent a dynamically
generated html page from a PHP link from being traversed?

  
  
I think I'd need to see an example log of files "being retrieved and
then not saved", to understand what you mean.
  

I put a log of this type above. By adjusting accept and reject, I can
exclude traversing a logout .php link (which I want to do), but I can't
seem to traverse links I want to traverse without also saving them
locally. It's not critical to resolve this for me, as I can always
delete what I don't want, but it is confusing. I wanted to make sure I
wasn't missing something.

  
3) Which has priority if both reject and accept filters match?

  
  Not sure; it's easy enough to test this yourself, though.
  

I have done lots of testing, so you'd think this simple one would be
obvious. The answer seems to be that reject is higher priority, since
identical accept= and reject= seem to produce no output. This matches
what the manual says. It might help to add to the manual that adding
an accept= filter causes a rejection of everything that does not match
the accept filter, even if there is no reject filter specified. The
fact that specifically accepting some files turns on a default
rejection of everything else surprised me, since the normal default is
to accept everything.

As a matter of interest, httrack uses the opposite logic. Adding a
specific accept in httrack has no effect if there is no reject. Thus,
the most common format is to reject everything followed by a list of
filetypes to accept. The wget procedure is more efficient since you
don't need the starting "reject everything," and why would you accept
if you didn't want to reject something else, but it

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-10 Thread Micah Cowan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Todd Pattist wrote:
 Thank you for the quick response.  Background is I'm on Windows XP, Gnu
 wget 1.11
 This doesn't affect traversal of HTML files functionality is currently
 implemented via a heuristic based on the filename extension. That is, if
 it ends in .htm or .html, I believe, then it will be traversed
 regardless of -A or -R settings, whereas .cgi or .php will not affect
 traversal.
   
 I'm not sure I understand the cgi or .php will not affect traversal. 

I mean, it will not detect these as HTML files, so the accept/reject
rules will be applied to them without exception.

 If I use wget to start at http://site.com/view.php?f=16 and recursively
 mirror without -A or -R, it looks like it  traverses deeper as though
 that page and other .php links are html files. This makes sense. (I say
 looks like, because it takes a long time and produces lots of files). 
 If I select the same page and add  accept=site.com/view.php?id=16 to
 wgetrc, no pages are saved and it does not traverse any deeper and it
 takes only a second or two.  I see this in the log:
 
 Saving to: `site.com/[EMAIL PROTECTED]'
 Removing site.com/[EMAIL PROTECTED] since it should be rejected.
 
 I recognize that the question mark was substituted for my OS, but that
 does not matter on the accept filter.  What does matter is whether I
 have the .html or not in the accept filter.  That surprises me.  Both
 accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16*
 will match and keep the
 site.com/[EMAIL PROTECTED] file, while both
 accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause
 it not to match and generate the Removing ... since it should be
 rejected line.  Regardless of the matching/saving this seems to control
 traversal, as I get far deeper traversal with no accept= at all.

After another look at the relevant portions of the source code, it looks
like accept/reject rules are _always_ applied against the local
filename, contrary to what I'd been thinking. This needs to be changed.
(But it probably won't be, any time soon.)

Note that the view.php?id=16 doesn't mean what you may perhaps think it
does: Wget detects the ? as a wildcard, and allows it to match any
character (including @). If you supplied \? instead (which matches a
literal question mark), I'm guessing it'd actually fail to match,
because it's checking against @.

My understanding is that, when you specify a URL directly at the
command-line, it will be downloaded and traversed (if it turns out to be
HTML), no matter what the accept/reject rules are (which can still cause
it to be removed afterwards). Therefore, I suspect that what Wget does
with your URL when it isn't matching the accept rules is:

  1. Downloads the named file
  2. Discovers that, regardless of the filename, it is indeed an HTML
file, so scans it for all links to be downloaded.
  3. After scanning for all the links, it doesn't find any that end in
.html, nor any that match the accept rules, so it doesn't do anything
else.

- --debug will definitely tell you whether it's bothering to scan that
first file or not, and what it decides to do with the links it finds.

 I'm pretty sure  I can control traversal of php links with accept and
 reject, but I often want to traverse looking for certain file types, but
 don't want to save all the php files traversed.

We're looking for more fine-grained controls to allow this sort of
thing, but at the moment, my understanding is that there is no control
over whether Wget traverses-and-then-deletes a given file: it will
_always_ do that for files it knows or suspects are HTML (based on .htm,
.html suffixes, or if, like the above example, it will download the
filename first anyway because it's an explicit command-line argument);
it will _never_ download/traverse any other sorts of links that do not
match the accept rules.

If something _does_ match the accept rules, and turns out after download
to be an HTML file (determined by the server's headers), it will
traverse it further; but of course it won't delete them afterward
because they matched the accept list.

 I'd have to look at the relevant code, but it's possible that
 directory-looking names may also be automatically traversed in that way.
   
 I don't want you to do work I can do myself.  I was just hoping for a
 link or some pointers that might help.

It looks like this idea was incorrect anyway; it's only based on the suffix.

 Does
 html_extension=on affect link traversal? 
 

 No; this only affects whether filenames are changed upon download to
 explicitly include an .html extension (useful for local browsing).
   
 
 It seems that the html extension is used in the filter matching of
 accept/reject, and that seems to affect traversal as described above
 unless I'm missing something (which is entirely possible).

Yes, it does; my bad.

 I'd like to be able to
 independently control link traversal vs. file retrieval

Re: Accept and Reject - particularly for PHP and CGI sites

2008-03-10 Thread Todd Pattist

This cleared up a lot.  I really appreciate your reply.  I've been using 
the log and the server_response = on
parameters, but not --debug.  I'll add that now and take a look, but 
your 1..2..3.. answer below and the comment that accept/reject matching 
is on the local filename explains what I'm seeing,   From your comments, 
I'm confident I can get it to do what I want, with the only problem 
being that I'll have to delete excess files.  That's not really a 
problem for me, as long as I understand what it is doing and why.


Micah Cowan wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Todd Pattist wrote:
  

Thank you for the quick response.  Background is I'm on Windows XP, Gnu
wget 1.11


This doesn't affect traversal of HTML files functionality is currently
implemented via a heuristic based on the filename extension. That is, if
it ends in .htm or .html, I believe, then it will be traversed
regardless of -A or -R settings, whereas .cgi or .php will not affect
traversal.
  
  
I'm not sure I understand the cgi or .php will not affect traversal. 



I mean, it will not detect these as HTML files, so the accept/reject
rules will be applied to them without exception.

  

If I use wget to start at http://site.com/view.php?f=16 and recursively
mirror without -A or -R, it looks like it  traverses deeper as though
that page and other .php links are html files. This makes sense. (I say
looks like, because it takes a long time and produces lots of files). 
If I select the same page and add  accept=site.com/view.php?id=16 to

wgetrc, no pages are saved and it does not traverse any deeper and it
takes only a second or two.  I see this in the log:

Saving to: `site.com/[EMAIL PROTECTED]'
Removing site.com/[EMAIL PROTECTED] since it should be rejected.

I recognize that the question mark was substituted for my OS, but that
does not matter on the accept filter.  What does matter is whether I
have the .html or not in the accept filter.  That surprises me.  Both
accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16*
will match and keep the
site.com/[EMAIL PROTECTED] file, while both
accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause
it not to match and generate the Removing ... since it should be
rejected line.  Regardless of the matching/saving this seems to control
traversal, as I get far deeper traversal with no accept= at all.



After another look at the relevant portions of the source code, it looks
like accept/reject rules are _always_ applied against the local
filename, contrary to what I'd been thinking. This needs to be changed.
(But it probably won't be, any time soon.)

Note that the view.php?id=16 doesn't mean what you may perhaps think it
does: Wget detects the ? as a wildcard, and allows it to match any
character (including @). If you supplied \? instead (which matches a
literal question mark), I'm guessing it'd actually fail to match,
because it's checking against @.

My understanding is that, when you specify a URL directly at the
command-line, it will be downloaded and traversed (if it turns out to be
HTML), no matter what the accept/reject rules are (which can still cause
it to be removed afterwards). Therefore, I suspect that what Wget does
with your URL when it isn't matching the accept rules is:

  1. Downloads the named file
  2. Discovers that, regardless of the filename, it is indeed an HTML
file, so scans it for all links to be downloaded.
  3. After scanning for all the links, it doesn't find any that end in
.html, nor any that match the accept rules, so it doesn't do anything
else.

- --debug will definitely tell you whether it's bothering to scan that
first file or not, and what it decides to do with the links it finds.

  

I'm pretty sure  I can control traversal of php links with accept and
reject, but I often want to traverse looking for certain file types, but
don't want to save all the php files traversed.



We're looking for more fine-grained controls to allow this sort of
thing, but at the moment, my understanding is that there is no control
over whether Wget traverses-and-then-deletes a given file: it will
_always_ do that for files it knows or suspects are HTML (based on .htm,
.html suffixes, or if, like the above example, it will download the
filename first anyway because it's an explicit command-line argument);
it will _never_ download/traverse any other sorts of links that do not
match the accept rules.

If something _does_ match the accept rules, and turns out after download
to be an HTML file (determined by the server's headers), it will
traverse it further; but of course it won't delete them afterward
because they matched the accept list.

  

I'd have to look at the relevant code, but it's possible that
directory-looking names may also be automatically traversed in that way.
  
  

I don't want you to do work I can do myself.  I was just hoping for a
link or some pointers that might help.



It

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

Re: Accept and Reject - particularly for PHP and CGI sites

9 matches

Site Navigation

Mail list logo

Footer information