[Nutch Wiki] Update of "FetchCycleOverlap" by OtisGospodnetic

Apache Wiki Wed, 07 May 2008 11:18:42 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap

The comment on the change is:
This won't work 100% correctly - removing it so I don't mislead people

------------------------------------------------------------------------------
- Without overlapping jobs people running Nutch are likely not utilizing their 
clusters fully.  Thus, here is a recipe for overlapping jobs:
+ deleted
  
- 0. imagine a cluster with M max maps and R max reduces (say M=R=8)
- 
- 1. run generate job with -numFetchers equal to M-2
- 
- 2. run a fetcher job (uses M-2 maps and later all R reduces)
- 
- 3. at this point, while the fetch job is still running, there are 2 open map 
slots for something else to run, say the updatedb job for the previously 
fetched/parsed segment
- 
- 4. when updatedb job is done the cluster can take on more jobs.  Any 
completed tasks (C) from the running fetcher job represent "open work slots"
- 
- 5. start another fetch job.  This will be able to use only C tasks, but C 
will grow as the first job opens up more slots, eventually hitting M-2 open 
slots.
- 
- 6. at some point, the fetch job from 2) above will complete, opening up 2 map 
slots, so updatedb can be run, even in the background, allowing the execution 
to go back to 1)
- 
- Because a URL is "locked out" for 7 days after the generate step included it 
into a fetchlist, the above cycle needs to complete within 7 days.  In more 
detail:
- 
- Generate updates the CrawlDb so that urls selected
- for the latest fetchlist become "locked out" for the next 7 days. This
- means that you can happily generate multiple fetchlists, and fetch them
- out of order, and then do the DB updates out of order, as you see fit,
- so long as you make it within the 7 days of the "lock out" period.
- 
- This means that it's practical to limit the numFetchers to a number
- below your cluster capacity, because then you can run other maintenance
- jobs in parallel with the currently running fetch job (such as updatedb
- and generate of next fetchlists).
-

[Nutch Wiki] Update of "FetchCycleOverlap" by OtisGospodnetic

Reply via email to