[Nutch Wiki] Update of "CrawlDatumStates" by AndrzejBia lecki

Apache Wiki Wed, 15 Sep 2010 13:58:36 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "CrawlDatumStates" page has been changed by AndrzejBialecki.
http://wiki.apache.org/nutch/CrawlDatumStates

--------------------------------------------------

New page:
Note: information here is specific to Nutch 1.x - conceptually the state 
machine should be identical in Nutch 2.0 but implementation details are 
different.

Nutch 1.x maintains state of pages in CrawlDb, which is updated by various 
tools:

* Injector - to populate CrawlDb with new URLs
* Generator - to generate new fetchlists, and optionally mark those URLs in 
CrawlDb as "being in the process of fetching"
* CrawlDb update - to update the CrawlDb with new knowledge about the already 
known URLs (already in CrawlDb) as well as add new URLs discovered from page 
outlinks.

Below is a state diagram of CrawlDatum, which is a class that holds this state 
in CrawlDb.

[Nutch Wiki] Update of "CrawlDatumStates" by AndrzejBia lecki

Reply via email to