[Nutch Wiki] Trivial Update of "CrawlDatumStates" by LewisJohnMcgibbney

Apache Wiki Mon, 21 Nov 2011 07:36:30 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "CrawlDatumStates" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/CrawlDatumStates?action=diff&rev1=2&rev2=3

  
  Nutch 1.x maintains state of pages in CrawlDb, which is updated by various 
tools:
  
- * Injector - to populate CrawlDb with new URLs * Generator - to generate new 
fetchlists, and optionally mark those URLs in CrawlDb as "being in the process 
of fetching" * CrawlDb update - to update the CrawlDb with new knowledge about 
the already known URLs (already in CrawlDb) as well as add new URLs discovered 
from page outlinks.
+  * Injector - to populate CrawlDb with new URLs 
+  * Generator - to generate new fetchlists, and optionally mark those URLs in 
CrawlDb as "being in the process of fetching" 
+  * CrawlDb update - to update the CrawlDb with new knowledge about the 
already known URLs (already in CrawlDb) as well as add new URLs discovered from 
page outlinks.
  
  Below is a state diagram of CrawlDatum, which is a class that holds this 
state in CrawlDb.
  
@@ -25, +27 @@

  
  If there was a temporary problem in fetching (e.g. exception or time out) 
then this URL is left as "unfetched" but its retry counter is incremented. If 
this counter reaches a limit (default is 3) the page is marked as "gone". Pages 
that are "gone" are not considered for fetching by Generator for a long time, 
which is the maxFetchInterval (e.g. 180 days) - the reason for keeping them is 
that even gone pages may re-appear after a while, and also we want to avoid 
re-discovering them and giving them a status of "unfetched".
  
- Other possible states after fetching are "truly gone" ;) (e.g. forbidden by 
robots.txt or unauthorized), which get the same treatment as described above - 
that is after a long period of time we check again their status, which may have 
changed.
+ Other possible states after fetching are "truly gone" ;) (e.g. forbidden by 
robots.txt or unauthorized), which get the same treatment as described above - 
that is after a long period of time we check again their status, which ma
  
- In case of "success" we mark this URL as "fetched". This URL is not eligible 
for re-fetching until after fetchInterval, at which point it's considered 
outdated and in need of re-fetching (i.e. the same as "unfetched").
-

[Nutch Wiki] Trivial Update of "CrawlDatumStates" by LewisJohnMcgibbney

Reply via email to