Hi,

I have a similar issue. I'm using wget recursively as a link checking spider. I don't save the files downloaded, so the -c and -N options won't help me. What I'd love is if wget could keep a list of the links it follows and not follow any link on the list. As it is, I download 250K links of which only 70K are unique. I'm thinking this is a feature request, but if there's a way I can cut down on the extra downloads today, I'd love to know it.

Here's the command I use:

wget --input-file=spider_pages.html --force-html --no-cache --no-check-certificate --recursive --page-requisites --no-parent -e "robots=off" --delete-after --no-directories --no-host-directories --no-verbose

Thanks
--Allan

Message: 1
Date: Sun, 27 Dec 2009 13:10:25 -0800
From: Micah Cowan <[email protected]>
Subject: Re: [Bug-wget] Prevent wget from redownloading when using
        recurise        option?
To: David <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1

David wrote:
Is there a way to prevent wget from redownloading files it has already
downloaded when using the recursive -r option? I know that -c is used
when downloading a large file but I wasn't sure if it also could be
used to accomplish this. It seems like even if it was set not to
download files it would still have to check to make sure the file had
been completely downloaded. Right now it's hard for me to tell if this
is its behavior when using -rc as the individual files are small and
thus do not take long to download (I cannot tell if wget is actually
downloading the full file or just requesting the file's size from the
server and moving on upon seeing that the file is already complete.

I typically use -rc. -rN is also a possibility.



Reply via email to