Intranet crawling maintenance

Daniel López Wed, 03 Jan 2007 03:59:45 -0800

Hi there,

I think I have it more or less thought out, but just in case I missedsomething, I would like to check with more experienced people.


I Have set up everything to crawl out intranet, with Nutch 0.7.

I create the initial index with something like:

bin/nutch crawl $MY_URL_FILE -dir $MY_CRAWL_DIR -depth X -topN Y

then periodically.... ( daily? ), I mantain such index with either:

.- The "Maintenance Shell Script" from "Nutch - The Java Search Engine -Nutch Wiki"

http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

or

.- The script from "IntranetRecrawl - Nutch Wiki"
http://wiki.apache.org/nutch/IntranetRecrawl

Both seem to be more or less equivalent. After one of thouse one wouldrestart the web application.

Then, it is recommended to remove the whole $MY_CRAWL_DIR every now andthen (months) and start all over. To do so one could create the newcrawl dir under a different name and then stop the web application,remove and rename the crawl directories and start the web application.

Would that be more or less correct? Any special preference for themaintenance script? I guess the recommended intervals for the cleaningand recrawling depend on the site, but any recommendation for a mediumintranet?

In order to pick up the latest news, would you recommend configuringspecial recrawls for the "news section" of the web site and run themmore frequently? (and then make the whole recrawl less frequent)


Any advice is welcome,
Thanks,
D.

Intranet crawling maintenance

Reply via email to