Is it safe to run these commands while the searcher (web-interface) is
using it? In other words can I just do the following:
1) crawl
2) start tomcat
3) setup a cron-job that runs the following commands every 5 days (for
my intranet I don't want to be up to 30 days behind): 1. generate, 2.
updatedb, 3. invertlinks, 4. index, 5. dedup, 6. merge
4) Sit back and enjoy my eternally up-to-date intranet search engine?
Thanks,
Thomas
Gal Nitzan wrote:
The crawl tool can be used only once.
After running the initial crawl you can not use this tool again.
From that point on you would run:
1. generate
2. updatedb
3. invertlinks
4. index
5. dedup
6. merge
The default parameter for fetching pages cycle is 30 days.
So basically if you finished crawling your intranet in the initial crawl
you would run your generate in 30 days.
However you can run the generate with the -adddays parameter set to 30
and it will generate a fetchlist with all pages already in your crawldb.
If your system contains new pages, the crawler would find it during the
fetch and would update the crawldb.
G.
On Thu, 2006-01-12 at 07:44 -0500, Andy Morris wrote:
After doing an initial crawl how do you keep that directory current.
How often should a intranet crawl be run. Should this be a cron job and
do I have to restart tomcat after each crawl?
Andy
-----Original Message-----
From: Tom White [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 11, 2006 4:21 AM
To: [email protected]
Subject: Introduction to Nutch, Part 1: Crawling
Hi,
I've written an article about using Nutch at the intranet scale, which
you may find interesting:
http://today.java.net/pub/a/today/2006/01/10/introduction-to-nutch-1.htm
l .
Please post any comments on the article page itself.
I've updated the wiki to link to it too.
Regards,
Tom