Hi there,
I think I have it more or less thought out, but just in case I missed
something, I would like to check with more experienced people.
I Have set up everything to crawl out intranet, with Nutch 0.7.
I create the initial index with something like:
bin/nutch crawl $MY_URL_FILE -dir $MY_CRAWL_DIR -depth X -topN Y
then periodically.... ( daily? ), I mantain such index with either:
.- The "Maintenance Shell Script" from "Nutch - The Java Search Engine -
Nutch Wiki"
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine
or
.- The script from "IntranetRecrawl - Nutch Wiki"
http://wiki.apache.org/nutch/IntranetRecrawl
Both seem to be more or less equivalent. After one of thouse one would
restart the web application.
Then, it is recommended to remove the whole $MY_CRAWL_DIR every now and
then (months) and start all over. To do so one could create the new
crawl dir under a different name and then stop the web application,
remove and rename the crawl directories and start the web application.
Would that be more or less correct? Any special preference for the
maintenance script? I guess the recommended intervals for the cleaning
and recrawling depend on the site, but any recommendation for a medium
intranet?
In order to pick up the latest news, would you recommend configuring
special recrawls for the "news section" of the web site and run them
more frequently? (and then make the whole recrawl less frequent)
Any advice is welcome,
Thanks,
D.