Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "CrawlDatumStates" page has been changed by AndrzejBialecki. http://wiki.apache.org/nutch/CrawlDatumStates -------------------------------------------------- New page: Note: information here is specific to Nutch 1.x - conceptually the state machine should be identical in Nutch 2.0 but implementation details are different. Nutch 1.x maintains state of pages in CrawlDb, which is updated by various tools: * Injector - to populate CrawlDb with new URLs * Generator - to generate new fetchlists, and optionally mark those URLs in CrawlDb as "being in the process of fetching" * CrawlDb update - to update the CrawlDb with new knowledge about the already known URLs (already in CrawlDb) as well as add new URLs discovered from page outlinks. Below is a state diagram of CrawlDatum, which is a class that holds this state in CrawlDb.

