Re: [Nutch-general] Removing pages from index immediately

Enis Soztutar Thu, 05 Apr 2007 04:45:33 -0700

Andrzej Bialecki wrote:

[EMAIL PROTECTED] wrote:
Hi Enis,
Right, I can easily delete the page from the Lucene index, though I'd
prefer to follow the Nutch protocol and avoid messing something up by
touching the index directly.  However, I don't want that page to
re-appear in one of the subsequent fetches.  Well, it won't
re-appear, because it will remain missing, but it would be great to
be able to tell Nutch to "forget it" "from everywhere".  Is that
doable? I could read and re-write the *Db Maps, but that's a lot of
IO... just to get a couple of URLs erased.  I'd prefer a friendly
persuasion where Nutch flags a given page as "forget this page as
soon as possible" and it just happens later on.
Somehow you need to flag those pages, and keep track of them, so theyhave to remain CrawlDb.
The simplest way to do this is, I think, through a scoring filter API- you can add your own filter, which during updatedb operation flagsunwanted urls (by means of putting a piece of metadata in CrawlDatum),and then during the generate step it checks this metadata and returnsthe generateScore = Float.MIN_VALUE - which means this page will neverbe selected for fetching as long as there are other unfetched pages.
You can also modify the Generator to completely skip such flagged pages.

Maybe we should permanently remove the urls that failed fetching k timesfrom the crawldb, during updatedb operation. Since the web is highlydynamic there can be as many gone sites as new sites(or slightly less).As far as i know once a url is entered to the crawldb it will staythere with one of the possible states : STATUS_DB_UNFETCHED,STATUS_DB_FETCHED, STATUS_DB_GONE, STATUS_LINKED. Am i right?


This way Otis's case will also be resolved.

Re: [Nutch-general] Removing pages from index immediately

Reply via email to