Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by OtisGospodnetic:
http://wiki.apache.org/nutch/FetchCycleOverlap

New page:
Without overlapping jobs people running Nutch are likely not utilizing their 
clusters fully.  Thus, here is a recipe for overlapping jobs:

0. imagine a cluster with M max maps and R max reduces (say M=R=8)

1. run generate job with -numFetchers equal to M-2

2. run a fetcher job (uses M-2 maps and later all R reduces)

3. at this point there are 2 open map slots for something else to run, say the 
updatedb job for the previously fetched/parsed segment

4. when updatedb job is done the cluster can take on more jobs.  Any completed 
tasks (C) from the running fetcher job represent "open work slots"

5. start another fetch job.  This will be able to use only C tasks, but C will 
grow as the first job opens up more slots, eventually hitting M-2 open slots.

6. at some point, the fetch job from 2) above will complete, opening up 2 map 
slots, so updatedb can be run, even in the background, allowing the execution 
to go back to 1)

Because a URL is "locked out" for 7 days after the generate step included it 
into a fetchlist, the above cycle needs to complete within 7 days.  In more 
detail:

Generate updates the CrawlDb so that urls selected
for the latest fetchlist become "locked out" for the next 7 days. This
means that you can happily generate multiple fetchlists, and fetch them
out of order, and then do the DB updates out of order, as you see fit,
so long as you make it within the 7 days of the "lock out" period.

This means that it's practical to limit the numFetchers to a number
below your cluster capacity, because then you can run other maintenance
jobs in parallel with the currently running fetch job (such as updatedb
and generate of next fetchlists).

Reply via email to