[Bug-wget] mirroring a Blogger blog without the comments

j045233 Thu, 24 Apr 2014 22:43:24 -0700

Dear wget community,

I'm playing with wget's mirroring functionality for the first time, and
first off, so far it's fantastic. Thanks for the great work!


I'm using a command like the following to create a (shallow) offline mirror
of my Blogger blog:

wget --tries=2 -e robots=off --span-hosts --timestamping
--recursive --level=2 --no-remove-listing --adjust-extension
--convert-links --page-requisites <MY_BLOG_URL>


Unfortunately the blog has some comment spam, and wget is dutifully
mirroring the spammers' pages which are linked to from the comments.

It occurs to me that it could be useful to be able to tell wget to just
ignore all comments sections of pages altogether. Is something like that
possible? I looked through the documentation and just found
--exclude-domains, which only helps when you know the domains you don't
want in advance.

I imagine an option like --exclude-crawling-within=<CSS_SELECTOR> could
accomplish this, where wget would ignore any DOM subtrees matching the
provided CSS selector (e.g. "#comments" in this case).

Even more general would be something like --next-urls-cmd=<CMD>, where you
could supply a command that accepts an HTTP response on stdin, and then
writes the set of URLs to stdout which should be crawled based on it. wget
could consult this command when in --recursive mode to allow more
customizable crawling behavior. This leaves any HTML parsing or regular
expression matching entirely up to the user.

Is there any interest in this? Is it feasible?

Thanks, and thanks again for the great work on wget.

[Bug-wget] mirroring a Blogger blog without the comments

Reply via email to