[Nutch Wiki] Trivial Update of "CrawlDatumStates" by SebastianNagel

Apache Wiki Tue, 06 Dec 2011 13:38:37 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "CrawlDatumStates" page has been changed by SebastianNagel:
http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=5&rev2=6

Comment:
restored part accidentally deleted in revision #3 (2011-11-21 15:36:01)

  
  If there was a temporary problem in fetching (e.g. exception or time out) 
then this URL is left as "unfetched" but its retry counter is incremented. If 
this counter reaches a limit (default is 3) the page is marked as "gone". Pages 
that are "gone" are not considered for fetching by Generator for a long time, 
which is the maxFetchInterval (e.g. 180 days) - the reason for keeping them is 
that even gone pages may re-appear after a while, and also we want to avoid 
re-discovering them and giving them a status of "unfetched".
  
- Other possible states after fetching are "truly gone" ;) (e.g. forbidden by 
robots.txt or unauthorized), which get the same treatment as described above - 
that is after a long period of time we check again their status, which ma
+ Other possible states after fetching are "truly gone" ;) (e.g. forbidden by 
robots.txt or unauthorized), which get the same treatment as described above - 
that is after a long period of time we check again their status, which may have 
changed.
  
+ In case of "success" we mark this URL as "fetched". This URL is not eligible 
for re-fetching until after fetchInterval, at which point it's considered 
outdated and in need of re-fetching (i.e. the same as "unfetched").
+

[Nutch Wiki] Trivial Update of "CrawlDatumStates" by SebastianNagel

Reply via email to