From what I have gathered is that you may want to keep multiple
crawldbs for your crawls. So, you could have a crawldb for more frequent crawls and fire off nutch and read that db with the appropriate configs for that job. I was hoping for the same mechanism, but it looks like we need to write this for ourselves.
On 4/12/07, Arie Karhendana <[EMAIL PROTECTED]> wrote:
Hi all, I'm a new user of Nutch. I use Nutch primarily to crawl blog and news sites. But I noticed that Nutch fetches pages only on some refresh interval (30 days default). Blog and news sites have unique characteristic that some of their pages are updated very frequently (e.g. the main page) so they have to be refetched often, while other pages don't need to be refreshed / refetched at all (e.g. the news article pages, which eventually will become 'obsolete'). Is there any way to force update some URLs? Can I just 're-inject' the URLs to set the next fetch date to 'immediately'? Thank you, -- Arie Karhendana
-- "Conscious decisions by concious minds are what make reality real"
