The -noAdditions feature would be ideal for my situation. Hopefully it
will be released soon.
Andrzej Bialecki wrote:
Jacob Brunson wrote:
So the depth number is the number of iterations the recrawl script
will go through. In each iteration, it will select a number of URLs
from the crawl database (generate), download the pages at those URLs
(fetch), and update the crawl database with the URLs that were fetched
as well as any new URLs found (updatedb).
If you want to redownload all your URLs in a single pass, you can set
the depth to 1, the topN value to something around the number of pages
you have in your database, and adddays to 31.
The problem though is how do you keep it from adding in all the new
URLs it finds during the crawl. You can either create nice regex
filters of the pages indexed to prevent this, or you could try
removing the updatedb command from the script and see what that does.
Removal of the updatedb command will certainly prevent your crawl
database from seeing any new URLs your fetch found, but it might also
have other bad consequences.
In the current trunk/ version updatedb supports an option
-noAdditions. If specified, only initially injected URLs will be
refreshed, and no new URLs will be added during updatedb operations.