Hi,
I have a similar issue. I'm using wget recursively as a link checking
spider. I don't save the files downloaded, so the -c and -N options
won't help me. What I'd love is if wget could keep a list of the links
it follows and not follow any link on the list. As it is, I download
250K links of which only 70K are unique.
I'm thinking this is a feature request, but if there's a way I can cut
down on the extra downloads today, I'd love to know it.
Here's the command I use:
wget --input-file=spider_pages.html --force-html --no-cache
--no-check-certificate --recursive --page-requisites --no-parent -e
"robots=off" --delete-after --no-directories --no-host-directories
--no-verbose
Thanks
--Allan
Message: 1
Date: Sun, 27 Dec 2009 13:10:25 -0800
From: Micah Cowan <[email protected]>
Subject: Re: [Bug-wget] Prevent wget from redownloading when using
recurise option?
To: David <[email protected]>
Cc: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=ISO-8859-1
David wrote:
Is there a way to prevent wget from redownloading files it has already
downloaded when using the recursive -r option? I know that -c is used
when downloading a large file but I wasn't sure if it also could be
used to accomplish this. It seems like even if it was set not to
download files it would still have to check to make sure the file had
been completely downloaded. Right now it's hard for me to tell if this
is its behavior when using -rc as the individual files are small and
thus do not take long to download (I cannot tell if wget is actually
downloading the full file or just requesting the file's size from the
server and moving on upon seeing that the file is already complete.
I typically use -rc. -rN is also a possibility.