Lyndon Maydwell wrote:
Am I right in assuming that broken pages (404) are removed once a page
is re-crawled and found missing?


Pages are never removed from the crawldb, unless you change URLFilters to remove them. Missing pages (404) are marked as GONE. Such pages may be linked to from several sites - and Nutch needs to know that we already discovered the page and what is its fetch status. If we simply removed them from the db, they would be discovered again, only this time we wouldn't know what their status was and we would have to try fetching them.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to