Andrzej Bialecki wrote:
[EMAIL PROTECTED] wrote:
Hi Enis,


Right, I can easily delete the page from the Lucene index, though I'd
prefer to follow the Nutch protocol and avoid messing something up by
touching the index directly.  However, I don't want that page to
re-appear in one of the subsequent fetches.  Well, it won't
re-appear, because it will remain missing, but it would be great to
be able to tell Nutch to "forget it" "from everywhere".  Is that
doable? I could read and re-write the *Db Maps, but that's a lot of
IO... just to get a couple of URLs erased.  I'd prefer a friendly
persuasion where Nutch flags a given page as "forget this page as
soon as possible" and it just happens later on.

Somehow you need to flag those pages, and keep track of them, so they have to remain CrawlDb.

The simplest way to do this is, I think, through a scoring filter API - you can add your own filter, which during updatedb operation flags unwanted urls (by means of putting a piece of metadata in CrawlDatum), and then during the generate step it checks this metadata and returns the generateScore = Float.MIN_VALUE - which means this page will never be selected for fetching as long as there are other unfetched pages.

You can also modify the Generator to completely skip such flagged pages.

Maybe we should permanently remove the urls that failed fetching k times from the crawldb, during updatedb operation. Since the web is highly dynamic there can be as many gone sites as new sites(or slightly less). As far as i know once a url is entered to the crawldb it will stay there with one of the possible states : STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_DB_GONE, STATUS_LINKED. Am i right?

This way Otis's case will also be resolved.

Reply via email to