[Nutch-general] Re: intranet crwl update

Thomas Delnoij Tue, 14 Feb 2006 01:01:00 -0800

I will try to answer your questions. If I am wrong, I am sure one of the
more experienced developers can correct me ...:)


- How do I update/refresh the index? There is no explanation or example
> about the intranet crawl!


The main index (in crawldir/index) is updated by the CrawlTool after every
cycle.

- What is the refresh period of the index? And how can I change it?


The refresh period of the index (in case you're using the CrawlTool -
otherwise it depends on how often you merge your indexes by hand) is
actually controlled by the db.default.fetch.interval property - the default
number of days between re-fetches of a page. By default this property is set
to 30 days - if you like to change it, copy the property definition from
nutch-default.xml to nutch-site.xml and change accordingly.

- What are the meta-tags nutch uses to decide if a page is new or modified?
> Or is the entire site recrawled with every update?


I don't think Nutch looks at the metatags to decide whether a page should be
refetched or not. The last-modified metatag can be indexed and queried
though; for this to work you need to enable the index-more and query-more
plugins.

- I need to refresh / update the index daily. Is that possible? There are
> every day content updates made by users, which I must


It is certainly possible, I think it mostly depend on how many pages your
site contais and your network/hardware setup, i.e. whether you can
fetch/parse/index all of the pages in one day. Off coure, you have to
db.default.fetch.interval property to value 1.

- If I deploy the nutch war on an application server, can I update/refresh
> the index by a servlet and not using an shell script? We are using an
> windows box and I don't want to install cygwin.


You can do your crawl cycle on a seperate box and when it is done merging
the indexes copy the crawl dir to the box running the app server.

Can someone send me an step by step explanation or an script that crawl and
> periodicallly refresh / updates the index for one site?


This is what the crawltool does - read the Java code of the
org.apache.nutch.tools.CrawlTool and you will get a good idea.

HTH - Thomas

[Nutch-general] Re: intranet crwl update

Reply via email to