Follow-up Comment #5, bug #20398 (project wget): I've found myself in need of this feature. I'm trying to download a website recursively without pulling in every single ad and its HTML. I'd like to be able to find out which URLs were rejected, why, and information about the domains (host, port, etc.)
I've patched my copy of Wget to dump all of this in to a CSV file which I can then tool through to get my desired results: % grep "DOMAIN" rejected.csv | head -1 DOMAIN,http://c0059637.cdn1.cloudfiles.rackspacecloud.com/flowplayer-3.2.6.min.js,SCHEME_HTTP,c0059637.cdn1.cloudfiles.rackspacecloud.com,80,flowplayer-3.2.6.min.js,(null),(null),(null),http://redated/,SCHEME_HTTP,redacted,80,,(null),(null),(null) % grep "DOMAIN" rejected.csv | cut -d"," -f4 | sort | uniq 0.gravatar.com 1.gravatar.com c0059637.cdn1.cloudfiles.rackspacecloud.com lh3.googleusercontent.com lh4.googleusercontent.com lh5.googleusercontent.com lh6.googleusercontent.com I've included a patch made in a few hours that does this. (file #33955) _______________________________________________________ Additional Item Attachment: File name: 0001-rejected-log-Add-option-to-dump-URL-rejections-to-a-.patch Size:14 KB _______________________________________________________ Reply to this item at: <http://savannah.gnu.org/bugs/?20398> _______________________________________________ Message sent via/by Savannah http://savannah.gnu.org/
