Options to purge db_gone records in updatedb
--------------------------------------------
Key: NUTCH-1101
URL: https://issues.apache.org/jira/browse/NUTCH-1101
Project: Nutch
Issue Type: New Feature
Components: linkdb
Affects Versions: 1.4
Reporter: Markus Jelsma
Fix For: 1.4
Add option to updatedb to filter out records with status db_gone (http 404).
This is especially useful in cases where a crawl db is targetted at only a
specific site. If the site, for some reason, suddenly changes a lot of url's
we'll get a crawl db filled with garbage. Since the targetted site is known (or
controlled) it is safe to get rid of all these url's: reduce db size, reduce
useless http requests.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira