[
https://issues.apache.org/jira/browse/NUTCH-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13094436#comment-13094436
]
Markus Jelsma commented on NUTCH-1101:
--------------------------------------
It's been tested on the production crawl db that actually suffered with this
problem. The output of two readdb -stats jobs all add up and make sense. The
records are purged and new DB_GONE records have been added from the segment.
These are to be removed in the next update cycle.
> Options to purge db_gone records in updatedb
> --------------------------------------------
>
> Key: NUTCH-1101
> URL: https://issues.apache.org/jira/browse/NUTCH-1101
> Project: Nutch
> Issue Type: New Feature
> Components: linkdb
> Affects Versions: 1.4
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.4
>
> Attachments: NUTCH-1101-1.4-1.patch
>
>
> Add option to updatedb to filter out records with status db_gone (http 404).
> This is especially useful in cases where a crawl db is targetted at only a
> specific site. If the site, for some reason, suddenly changes a lot of url's
> we'll get a crawl db filled with garbage. Since the targetted site is known
> (or controlled) it is safe to get rid of all these url's: reduce db size,
> reduce useless http requests.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira