Options to purge db_gone records in updatedb
--------------------------------------------

                 Key: NUTCH-1101
                 URL: https://issues.apache.org/jira/browse/NUTCH-1101
             Project: Nutch
          Issue Type: New Feature
          Components: linkdb
    Affects Versions: 1.4
            Reporter: Markus Jelsma
             Fix For: 1.4


Add option to updatedb to filter out records with status db_gone (http 404). 
This is especially useful in cases where a crawl db is targetted at only a 
specific site. If the site, for some reason, suddenly changes a lot of url's 
we'll get a crawl db filled with garbage. Since the targetted site is known (or 
controlled) it is safe to get rid of all these url's: reduce db size, reduce 
useless http requests.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to