I'm using a trunk version of nutch (0.8) and I would like to know the best way to crawl nightly and update an index. My problem is that index takes too long when given all segments.
Right now my typical nightly run looks like this: repeat several times: generate fetch updatedb then: invertlinks index dedup merge This works but the index step at the end takes a long time. I would like to index just the most recently crawled pages, then merge them into the main index. Is there a recommended way of doing this? Should I point to a separate directory for the index, and crawldb arguments of the index command, passing in just the most recently crawled segments, then use merge to combine this with the indexes of old groups of segments? I would be interested in hearing how people manage nightly crawling like this, specifically with the 0.8 version. Thanks. -- Derek Young
