Hello wget list,

A question that has been bugging me for quite some time…

If a site has a large amount of hotlinked images, videos, etc… how could
one perform an infinite recursive crawl that included hotlinked images,
etc, without invoking -H, which would grab unwanted material, and in some
cases get way out of control?

Heritrix has an option for this:
https://webarchive.jira.com/wiki/display/Heritrix/unexpected+offsite+content

Httrack has an option, using the --near flag:
http://www.httrack.com/html/fcguide.html

This is essentially the only thing preventing me from solely using wget for
my web archiving needs… am I missing something?

Thanks,
Ben

Reply via email to