Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "CrawlDatumStates" page has been changed by SebastianNagel: http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=5&rev2=6 Comment: restored part accidentally deleted in revision #3 (2011-11-21 15:36:01) If there was a temporary problem in fetching (e.g. exception or time out) then this URL is left as "unfetched" but its retry counter is incremented. If this counter reaches a limit (default is 3) the page is marked as "gone". Pages that are "gone" are not considered for fetching by Generator for a long time, which is the maxFetchInterval (e.g. 180 days) - the reason for keeping them is that even gone pages may re-appear after a while, and also we want to avoid re-discovering them and giving them a status of "unfetched". - Other possible states after fetching are "truly gone" ;) (e.g. forbidden by robots.txt or unauthorized), which get the same treatment as described above - that is after a long period of time we check again their status, which ma + Other possible states after fetching are "truly gone" ;) (e.g. forbidden by robots.txt or unauthorized), which get the same treatment as described above - that is after a long period of time we check again their status, which may have changed. + In case of "success" we mark this URL as "fetched". This URL is not eligible for re-fetching until after fetchInterval, at which point it's considered outdated and in need of re-fetching (i.e. the same as "unfetched"). +

