Stefan - thanks, I'll also find this really helpful as I'm trying to do the same.

I'm curious as to how this "update" actually works - does the Nutch crawler only fetch updated pages? If not, with Nutch, is it simply better to run a new crawl from scratch every day?

Also, is there also a way to find out how many URL's you have in your index?

Many thanks,

Dean

----- Original Message ----- From: "Stefan Groschupf" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, November 07, 2005 11:06 PM
Subject: Re: How do I update a nutch db?


Well,
you can use the normal nutch tools for that, but you may need to
setup the url filter so that they filter the correct pages.
Than you can:
// generate a segment
bin/nutch bin/nutch generate yourDb aSegmentFolder
// get the segment
seg=`ls -d aSegmentFolder/2* | tail -1`
// fetch the segment
bin/nutch fetch $seg
// update the webdb with the content of the freshly fetched segment
bin/nutch updatedb db $seg
// index the segment
bin/nutch index $seg

May this document gives you more understanding of the procedure...
http://wiki.media-style.com/display/nutchDocu/Home

HTH
Stefan




Am 07.11.2005 um 23:50 schrieb Paul M Lieberman:

I've created a db of roughly 250,000 entries from a few of our
websites.  I did this with CrawlTool (depth 10).

How would I go about doing a nightly update to add more pages to
the db?

I have looked high and low through the documentation, and have not
been able to ferret this out.

TIA,

Paul Lieberman
American Psychological Association


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net




Reply via email to