[Bug-wget] [bug #20398] Save a list of the links that were not followed

Jookia Thu, 07 May 2015 08:59:41 -0700

Follow-up Comment #5, bug #20398 (project wget):

I've found myself in need of this feature. I'm trying to download a website
recursively without pulling in every single ad and its HTML. I'd like to be
able to find out which URLs were rejected, why, and information about the
domains (host, port, etc.)


I've patched my copy of Wget to dump all of this in to a CSV file which I can
then tool through to get my desired results:



% grep "DOMAIN" rejected.csv | head -1
DOMAIN,http://c0059637.cdn1.cloudfiles.rackspacecloud.com/flowplayer-3.2.6.min.js,SCHEME_HTTP,c0059637.cdn1.cloudfiles.rackspacecloud.com,80,flowplayer-3.2.6.min.js,(null),(null),(null),http://redated/,SCHEME_HTTP,redacted,80,,(null),(null),(null)
% grep "DOMAIN" rejected.csv | cut -d"," -f4 | sort | uniq   
0.gravatar.com
1.gravatar.com
c0059637.cdn1.cloudfiles.rackspacecloud.com
lh3.googleusercontent.com
lh4.googleusercontent.com
lh5.googleusercontent.com
lh6.googleusercontent.com


I've included a patch made in a few hours that does this.

(file #33955)
    _______________________________________________________

Additional Item Attachment:

File name: 0001-rejected-log-Add-option-to-dump-URL-rejections-to-a-.patch
Size:14 KB


    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?20398>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Bug-wget] [bug #20398] Save a list of the links that were not followed

Reply via email to