I will try to answer your questions. If I am wrong, I am sure one of the more experienced developers can correct me ...:)
- How do I update/refresh the index? There is no explanation or example > about the intranet crawl! The main index (in crawldir/index) is updated by the CrawlTool after every cycle. - What is the refresh period of the index? And how can I change it? The refresh period of the index (in case you're using the CrawlTool - otherwise it depends on how often you merge your indexes by hand) is actually controlled by the db.default.fetch.interval property - the default number of days between re-fetches of a page. By default this property is set to 30 days - if you like to change it, copy the property definition from nutch-default.xml to nutch-site.xml and change accordingly. - What are the meta-tags nutch uses to decide if a page is new or modified? > Or is the entire site recrawled with every update? I don't think Nutch looks at the metatags to decide whether a page should be refetched or not. The last-modified metatag can be indexed and queried though; for this to work you need to enable the index-more and query-more plugins. - I need to refresh / update the index daily. Is that possible? There are > every day content updates made by users, which I must It is certainly possible, I think it mostly depend on how many pages your site contais and your network/hardware setup, i.e. whether you can fetch/parse/index all of the pages in one day. Off coure, you have to db.default.fetch.interval property to value 1. - If I deploy the nutch war on an application server, can I update/refresh > the index by a servlet and not using an shell script? We are using an > windows box and I don't want to install cygwin. You can do your crawl cycle on a seperate box and when it is done merging the indexes copy the crawl dir to the box running the app server. Can someone send me an step by step explanation or an script that crawl and > periodicallly refresh / updates the index for one site? This is what the crawltool does - read the Java code of the org.apache.nutch.tools.CrawlTool and you will get a good idea. HTH - Thomas
