Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by OtisGospodnetic: http://wiki.apache.org/nutch/FetchCycleOverlap The comment on the change is: This won't work 100% correctly - removing it so I don't mislead people ------------------------------------------------------------------------------ - Without overlapping jobs people running Nutch are likely not utilizing their clusters fully. Thus, here is a recipe for overlapping jobs: + deleted - 0. imagine a cluster with M max maps and R max reduces (say M=R=8) - - 1. run generate job with -numFetchers equal to M-2 - - 2. run a fetcher job (uses M-2 maps and later all R reduces) - - 3. at this point, while the fetch job is still running, there are 2 open map slots for something else to run, say the updatedb job for the previously fetched/parsed segment - - 4. when updatedb job is done the cluster can take on more jobs. Any completed tasks (C) from the running fetcher job represent "open work slots" - - 5. start another fetch job. This will be able to use only C tasks, but C will grow as the first job opens up more slots, eventually hitting M-2 open slots. - - 6. at some point, the fetch job from 2) above will complete, opening up 2 map slots, so updatedb can be run, even in the background, allowing the execution to go back to 1) - - Because a URL is "locked out" for 7 days after the generate step included it into a fetchlist, the above cycle needs to complete within 7 days. In more detail: - - Generate updates the CrawlDb so that urls selected - for the latest fetchlist become "locked out" for the next 7 days. This - means that you can happily generate multiple fetchlists, and fetch them - out of order, and then do the DB updates out of order, as you see fit, - so long as you make it within the 7 days of the "lock out" period. - - This means that it's practical to limit the numFetchers to a number - below your cluster capacity, because then you can run other maintenance - jobs in parallel with the currently running fetch job (such as updatedb - and generate of next fetchlists). -