Lyndon Maydwell wrote:
Am I right in assuming that broken pages (404) are removed once a page is re-crawled and found missing?
Pages are never removed from the crawldb, unless you change URLFilters to remove them. Missing pages (404) are marked as GONE. Such pages may be linked to from several sites - and Nutch needs to know that we already discovered the page and what is its fetch status. If we simply removed them from the db, they would be discovered again, only this time we wouldn't know what their status was and we would have to try fetching them.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
