RE: Introduction to Nutch, Part 1: Crawling

Gal Nitzan Thu, 12 Jan 2006 05:04:01 -0800

The crawl tool can be used only once.

After running the initial crawl you can not use this tool again.

>From that point on you would run:

1. generate
2. updatedb
3. invertlinks
4. index
5. dedup
6. merge

The default parameter for fetching pages cycle is 30 days.

So basically if you finished crawling your intranet in the initial crawl
you would run your generate in 30 days.

However you can run the generate with the -adddays parameter set to 30
and it will generate a fetchlist with all pages already in your crawldb.

If your system contains new pages, the crawler would find it during the
fetch and would update the crawldb.

G.

On Thu, 2006-01-12 at 07:44 -0500, Andy Morris wrote:
> After doing an initial crawl how do you keep that directory current.
> How often should a intranet crawl be run.  Should this be a cron job and
> do I have to restart tomcat after each crawl?
> 
> Andy
> -----Original Message-----
> From: Tom White [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, January 11, 2006 4:21 AM
> To: [email protected]
> Subject: Introduction to Nutch, Part 1: Crawling
> 
> Hi,
> 
> I've written an article about using Nutch at the intranet scale, which
> you may find interesting:
> http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.htm
> l .
> Please post any comments on the article page itself.
> 
> I've updated the wiki to link to it too.
> 
> Regards,
> 
> Tom
>

RE: Introduction to Nutch, Part 1: Crawling

Reply via email to