Re: Introduction to Nutch, Part 1: Crawling

Thomas Sondergaard Thu, 12 Jan 2006 09:05:35 -0800

Is it safe to run these commands while the searcher (web-interface) isusing it? In other words can I just do the following:


1) crawl
2) start tomcat

3) setup a cron-job that runs the following commands every 5 days (formy intranet I don't want to be up to 30 days behind): 1. generate, 2.updatedb, 3. invertlinks, 4. index, 5. dedup, 6. merge

4) Sit back and enjoy my eternally up-to-date intranet search engine?


Thanks,

Thomas


Gal Nitzan wrote:

The crawl tool can be used only once.

After running the initial crawl you can not use this tool again.

From that point on you would run:


1. generate
2. updatedb
3. invertlinks
4. index
5. dedup
6. merge

The default parameter for fetching pages cycle is 30 days.

So basically if you finished crawling your intranet in the initial crawl
you would run your generate in 30 days.

However you can run the generate with the -adddays parameter set to 30
and it will generate a fetchlist with all pages already in your crawldb.

If your system contains new pages, the crawler would find it during the
fetch and would update the crawldb.

G.

On Thu, 2006-01-12 at 07:44 -0500, Andy Morris wrote:

After doing an initial crawl how do you keep that directory current.
How often should a intranet crawl be run.  Should this be a cron job and
do I have to restart tomcat after each crawl?

Andy
-----Original Message-----

From: Tom White [mailto:[EMAIL PROTECTED]Sent: Wednesday, January 11, 2006 4:21 AM

To: [email protected]
Subject: Introduction to Nutch, Part 1: Crawling

Hi,

I've written an article about using Nutch at the intranet scale, which
you may find interesting:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.htm
l .
Please post any comments on the article page itself.

I've updated the wiki to link to it too.

Regards,

Tom

Re: Introduction to Nutch, Part 1: Crawling

Reply via email to