[Nutch-general] incremental crawling in nutch-0.8

Derek Young Tue, 31 Jan 2006 12:32:05 -0800

I'm using a trunk version of nutch (0.8) and I would like to know the best
way to crawl nightly and update an index.  My problem is that index takes
too long when given all segments.


Right now my typical nightly run looks like this:

repeat several times:
  generate
  fetch
  updatedb

then:
  invertlinks
  index
  dedup
  merge

This works but the index step at the end takes a long time.  I would like to
index just the most recently crawled pages, then merge them into the main
index.  Is there a recommended way of doing this?  Should I point to a
separate directory for the index, and crawldb arguments of the index
command, passing in just the most recently crawled segments, then use merge
to combine this with the indexes of old groups of segments?

I would be interested in hearing how people manage nightly crawling like
this, specifically with the 0.8 version.

Thanks.

-- Derek Young

[Nutch-general] incremental crawling in nutch-0.8

Reply via email to