On 06/07/2010 01:27 PM, Tony Lewis wrote: > Micah Cowan wrote: > >> For some value of "quickly". This obviously necessitates extra >> round-trips to the server. Can still be useful, but still perhaps not as >> useful as doing URL-matching properly. > > I would prefer an extra round trip to avoid downloading a 2GB file that will > immediately be ignored and deleted.
Such files very rarely end in ".html" AFAICT, but okay, sure. As I said, it's useful, but the extra round trip would need to be clearly documented, and it would be particularly effective when paired with something that prevents even the first round-trip when it's unnecessary. > While I would love to see proper URL matching in wget, I don't think that > solves the problem for this use case. I think we to want to parse all > text/html regardless of the URL. I don't think the second sentence follows the first particularly well. Just because one wants to control downloads based by content-type does not imply that one doesn't also want to control by URL. I realize that there are cases where one wants to trawl an entire website looking to keep only specific types, and in that case content-type matching fits the bill. But I've never personally been in that situation. The closest I've been is situations where I want to trawl some _portion_ of an entire website, in which case I want both content-type matching _and_ better URL matching, which is why I said it works best when _combined_ with URL-matching. And again, this is _particularly_ the case because I rarely ever encounter a site where you can effectively use content-types in such a way that URL-matching could not have done the same thing better (without extra round-trips). This is because, when sites don't advertise content types via extensions, it's most often because they're hidden behind CGI scripts or such (foo.php?filename=file-i-don't-want-to-download.wmv), and those rarely ever respond correctly to HEAD requests. This type of URL is a great example because it won't be solved by wget's current leave-off-the-query-string matching behavior, nor by checking HEAD's content-type, and the only way to avoid downloading it is by improved URL matching. It _could_ be solved by terminating connection when we see that the body of a GET request is of a "reject" content-type, but that is not efficient behavior (especially when we're proxied or using NTLM), and could contribute to loading the server unnecessarily if we're repeatedly asking for things we don't intend to accept (though one might argue they get what's coming to them for not supplying a proper HEAD response :) ). In addition, failing to provide proper URL matching means that Wget behaves completely inappropriately when it comes to CRM-style sites, wikis in particular. Wget currently has no means of distinguishing pages with page.php?action=logout, page.php?action=delete, or page.php?perform-some-cpu-intensive-transformation-to-pdf-or-whatnot, which is a pretty major gap. Most of the major wikis supply robots rules that prevent this, but not all, and those robots rules may contain other, less-appropriate bans, because after all they're intended for robots, not user agents. -- Micah J. Cowan http://micah.cowan.name/
