Dear wget community, I'm playing with wget's mirroring functionality for the first time, and first off, so far it's fantastic. Thanks for the great work!
I'm using a command like the following to create a (shallow) offline mirror of my Blogger blog: wget --tries=2 -e robots=off --span-hosts --timestamping --recursive --level=2 --no-remove-listing --adjust-extension --convert-links --page-requisites <MY_BLOG_URL> Unfortunately the blog has some comment spam, and wget is dutifully mirroring the spammers' pages which are linked to from the comments. It occurs to me that it could be useful to be able to tell wget to just ignore all comments sections of pages altogether. Is something like that possible? I looked through the documentation and just found --exclude-domains, which only helps when you know the domains you don't want in advance. I imagine an option like --exclude-crawling-within=<CSS_SELECTOR> could accomplish this, where wget would ignore any DOM subtrees matching the provided CSS selector (e.g. "#comments" in this case). Even more general would be something like --next-urls-cmd=<CMD>, where you could supply a command that accepts an HTTP response on stdin, and then writes the set of URLs to stdout which should be crawled based on it. wget could consult this command when in --recursive mode to allow more customizable crawling behavior. This leaves any HTML parsing or regular expression matching entirely up to the user. Is there any interest in this? Is it feasible? Thanks, and thanks again for the great work on wget.
