Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "CrawlDatumStates" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=2&rev2=3 Nutch 1.x maintains state of pages in CrawlDb, which is updated by various tools: - * Injector - to populate CrawlDb with new URLs * Generator - to generate new fetchlists, and optionally mark those URLs in CrawlDb as "being in the process of fetching" * CrawlDb update - to update the CrawlDb with new knowledge about the already known URLs (already in CrawlDb) as well as add new URLs discovered from page outlinks. + * Injector - to populate CrawlDb with new URLs + * Generator - to generate new fetchlists, and optionally mark those URLs in CrawlDb as "being in the process of fetching" + * CrawlDb update - to update the CrawlDb with new knowledge about the already known URLs (already in CrawlDb) as well as add new URLs discovered from page outlinks. Below is a state diagram of CrawlDatum, which is a class that holds this state in CrawlDb. @@ -25, +27 @@ If there was a temporary problem in fetching (e.g. exception or time out) then this URL is left as "unfetched" but its retry counter is incremented. If this counter reaches a limit (default is 3) the page is marked as "gone". Pages that are "gone" are not considered for fetching by Generator for a long time, which is the maxFetchInterval (e.g. 180 days) - the reason for keeping them is that even gone pages may re-appear after a while, and also we want to avoid re-discovering them and giving them a status of "unfetched". - Other possible states after fetching are "truly gone" ;) (e.g. forbidden by robots.txt or unauthorized), which get the same treatment as described above - that is after a long period of time we check again their status, which may have changed. + Other possible states after fetching are "truly gone" ;) (e.g. forbidden by robots.txt or unauthorized), which get the same treatment as described above - that is after a long period of time we check again their status, which ma - In case of "success" we mark this URL as "fetched". This URL is not eligible for re-fetching until after fetchInterval, at which point it's considered outdated and in need of re-fetching (i.e. the same as "unfetched"). -

