Did anyone else hear that "click" sound, or was it just something in my head?
I was missing the info about the 7 day lockout. Now that makes sense. It makes even more sense when I look at machines running jobs with N tasks total, out of which N-1 are completed, and the remaining one is draaaaging. In a system where jobs are submitted sequentially, this means underutilized nodes. So does the following type of scheduling for Nutch jobs make sense: 0) imagine a cluster with M max maps and R max reduces (say M=R=8) 1) run generate job with -numFetchers equal to M-2 2) run a fetcher job (uses M-2 maps and later all R reduces) 3) at this point there are 2 open map slots for something else to run, say the updatedb job for the previously fetched/parsed segment 4) when updatedb job is done the cluster can take on more jobs. Any completed tasks (C) from the running fetcher job represent "open work slots" 5) start another fetch job. This will be able to use only C tasks, but C will grow as the first job opens up more slots, eventually hitting M-2 open slots. 6) at some point, the fetch job from 2) above will complete, opening up 2 map slots, so updatedb can be run, even in the background, allowing the execution to go back to 1) Is this all correct? If it is, or when it is, I'll stick it on the Wiki. Without overlapping jobs and getting the procedure right, people running Nuthc must not be utilizing their clusters fully. Did I get the numbers (M-2) right? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- From: Andrzej Bialecki <[EMAIL PROTECTED]> To: [email protected] Sent: Thursday, April 17, 2008 4:05:42 AM Subject: Re: Parallel operations in fetch [EMAIL PROTECTED] wrote: > Right. So it sounds like it really makes most sense when one wants > to limit things, not increase them. > > I'm thinking back to the original question of parallelizing / > overlapping different operations or steps in order to speed up the > overall process. It doesn't sound like there is anything in the > generate / fetch / parse / updatedb process that one can overlap. Quite the contrary. Generate updates the CrawlDb so that urls selected for the latest fetchlist become "locked out" for the next 7 days. This means that you can happily generate multiple fetchlists, and fetch them out of order, and then do the DB updates out of order, as you see fit, so long as you make it within the 7 days of the "lock out" period. This means that it's practical to limit the numFetchers to a number below your cluster capacity, because then you can run other maintenance jobs in parallel with the currently running fetch job (such as updatedb and generate of next fetchlists). > The only thing I can think of is that it probably pays of to have > larger fetchlists, as it seems that it takes Generator just as much > time to generate a larger fetchlist as it does to generate a small > one. Thus, with a larger fetchlist one at least avoids waiting for > multiple Generator runs. The observation about the time is correct, and it makes sense if you think about the way that Generator works. It needs to process all urls in the DB to examine their status, and then select a (presumably small) subset, so that both phases involve the processing of similar amounts of data, no matter what is the fetchlist size (and anyway the second phase is dominated by Hadoop overhead ;) ). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
