Re: Accept and Reject - particularly for PHP and CGI sites

Micah Cowan Thu, 20 Mar 2008 09:07:44 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Pattist wrote:
>>
>> > When deciding whether it should delete a file afterwards, however, it
>> > uses the _local_ filename (relevant code also in recur.c, near "Either
>> > --delete-after was specified,"). I'm not positive, but this probably
>> > means query strings _do_ matter in that case. :p
>> > 
>> > Confused? Coz I sure am!
>>
>> I had thought there was already an issue filed against this, but upon
>> searching discovered I was thinking of a couple related bug that had
>> been closed. I've filed a new issue for this:
>>
>> https://savannah.gnu.org/bugs/?22670
> 
> I'm not sure whether this post should go into the buglist discussion or
> here, but I'll put it here.
> 
> I have to say, I'm not sure this is properly classed as a bug.  If
> accept/reject applies to the original URL filename, why should the code
> bother to apply it again to the local name?  If filters don't pass the
> URL filename and wget doesn't retrieve the file, it can't save it.  I
> assume the answer was to handle script and content_disposition cases
> where you don't know what you're going to get back.


Either way, that still sounds like a bug to me.

The reason, was that .html/.htm files are always downloaded, no matter
what the accept/reject rules are, so that it can process them for links,
but Wget does a check after downloading them to see if it should be deleted.

As you point out, it will have already been checked, so perhaps setting
a flag at that point that can be checked later, rather than doing the
whole check again, may be better.

That won't deal with Content-Disposition properly, but actually I have
my doubts about whether we really want to be doing name-based
accept/rejects on Content-Disposition. Should probably allow it as an
option, but off by default; most people really want to prevent Wget from
traversing certain URL patterns, and not actual "suggested names" of files.

> If you match only
> on URL, you'd have no way to control traversing separate from file
> retention, and that's something you definitely want.  (It's the default
> for conventional html based sites.) To put it another way, I usually
> want to download all the php files, and traverse all that turn out to be
> html, but I may only want to keep the zips or jpgs.  With two checks,
> one before download on the URL filename and another after download on
> the local filename, I've got some control in cgi, php script based sites
> that is similar to the control in a conventional html page site.

Agreed, and there's been discussion on that before. That's a separate
feature, though. When it's time to implement that feature, we'll do so.
Even so, I'm not sure we want to do that match on the actual local
filename: we probably still want to do it on the URL path, otherwise
we'll still match things like .html from -k, or number suffixes from -nd.

> If this behavior is changed, then you'd probably need to have two sets
> of accept/reject filters that could be defined separately, one set to
> control traversing, and one to control file retention.  I'd actually
> prefer that, particularly with matching extended to the query string
> portion of the URL.  Right now, it may be impossible to prevent
> traversing some links.  If you don't want to traverse
> "index.php?mode=logout", but do want to get "index.php?mode=getfile"
> there's no way to do it since the URL filename is the same.

Yup. That peeves me. It's scheduled for 1.12, IIRC.

> In the short term, it would help to add something to the documentation
> in the accept/reject  area, such as the following:
> 
> The accept/reject filters are applied to the filename twice - once to
> the filename in the URL before downloading to determine if the file
> should be retrieved (and parsed for more links if it is determined after
> download to be an html file) and again to the local filename after it is
> retrieved to determine if it should be kept.  The local filename after
> retrieval may be significantly different from the URL filename before
> retrieval for many reasons.  These include:
> 1) The URL filename does not include any query string portion of the
> URL, such as the string "?topic=16" in the URL
> "http://site.com/index.php?topic=16";. After download the file may be
> stored as the local filename "[EMAIL PROTECTED]".  Accept/reject
> matching does not apply to the URL query string portion before download,
> but will apply after download when the query string is incorporated into
> the local filename.
> 2) When content disposition is on, the local filename may be completely
> different from the URL filename.  The URL "index.php?getfile=21" may
> return a content disposition header producing a local file of
> "some_interesting_file.zip".
> 3) The -E (html extension) and sometimes the -nd (no directories) 
> switches will alter the filename suffix by adding .html or .1 for
> duplicate files.
> 
> If the URL filename in links found when the starting page is parsed do
> not pass the accept/reject filters, the links will not be followed and
> will not be parsed for more links unless the filename ends html or htm. 
> If accept/reject filters are used on cgi, php, asp and similar script
> based sites the URL filename must pass the filters (without considering
> any query string portion) if the links are to be traversed/parsed, and
> the local filename must pass the filters if the retrieved files are to
> be retained.

I'm not sure that describing a bug at length is really appropriate for
the manual, though it would probably reduce some confusion (except that
the process itself is still confusing to read). Better just to fix the
way Wget works, and document that.

If we were going to leave this behavior in for some time, then I think
it'd be appropriate to at least mention it (maybe I'll just mention it
anyway, without a comprehensive explanation); but since I'm planning on
fixing the more seriously broken aspects of it in 1.12, it's probably
not worth a lengthy explanation.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH4ouq7M8hyUobTrERAocYAJ9sXsHadWB9MPKeZxfmY5/WvK9vagCcCYPo
NkHJaqCZRBwxsxeHCIV/nGI=
=qcOO
-----END PGP SIGNATURE-----

Re: Accept and Reject - particularly for PHP and CGI sites

Reply via email to