Re: Recrawl script

Jacob Brunson Tue, 10 Oct 2006 13:25:29 -0700

So the depth number is the number of iterations the recrawl script
will go through.  In each iteration, it will select a number of URLs
from the crawl database (generate), download the pages at those URLs
(fetch), and update the crawl database with the URLs that were fetched
as well as any new URLs found (updatedb).


If you want to redownload all your URLs in a single pass, you can set
the depth to 1, the topN value to something around the number of pages
you have in your database, and adddays to 31.

The problem though is how do you keep it from adding in all the new
URLs it finds during the crawl.  You can either create nice regex
filters of the pages indexed to prevent this, or you could try
removing the updatedb command from the script and see what that does.
Removal of the updatedb command will certainly prevent your crawl
database from seeing any new URLs your fetch found, but it might also
have other bad consequences.

On 10/10/06, Chris Stephens <[EMAIL PROTECTED]> wrote:

How does the depth option work on the 0.8 recrawl script that is on
http://wiki.apache.org/nutch/IntranetRecrawl .  I just want to re-index
all of the pages currently in the db and not index any new pages these
pages might link to.  Should I use a 0 for this?  It seems like the
fetcher never runs when I do 0, and if I do anything above zero it
starts indexing at a further depth then what is currently in my crawl
db, which is further then I desire.

-Chris Stephens



--
http://JacobBrunson.com

Re: Recrawl script

Reply via email to