Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "CrawlDatumStates" page has been changed by AndrzejBialecki.
http://wiki.apache.org/nutch/CrawlDatumStates

--------------------------------------------------

New page:
Note: information here is specific to Nutch 1.x - conceptually the state 
machine should be identical in Nutch 2.0 but implementation details are 
different.

Nutch 1.x maintains state of pages in CrawlDb, which is updated by various 
tools:

* Injector - to populate CrawlDb with new URLs
* Generator - to generate new fetchlists, and optionally mark those URLs in 
CrawlDb as "being in the process of fetching"
* CrawlDb update - to update the CrawlDb with new knowledge about the already 
known URLs (already in CrawlDb) as well as add new URLs discovered from page 
outlinks.

Below is a state diagram of CrawlDatum, which is a class that holds this state 
in CrawlDb.

Reply via email to