Did anyone else hear that "click" sound, or was it just something in my head?


I was missing the info about the 7 day lockout.  Now that makes sense.  It 
makes even more sense when I look at machines running jobs with N tasks total, 
out of which N-1 are completed, and the remaining one is draaaaging.  In a 
system where jobs are submitted sequentially, this means underutilized nodes.

So does the following type of scheduling for Nutch jobs make sense:

0) imagine a cluster with M max maps and R max reduces (say M=R=8)

1) run generate job with -numFetchers equal to M-2

2) run a fetcher job (uses M-2 maps and later all R reduces)

3) at this point there are 2 open map slots for something else to run, say the 
updatedb job for the previously fetched/parsed segment

4) when updatedb job is done the cluster can take on more jobs.  Any completed 
tasks (C) from the running fetcher job represent "open work slots"

5) start another fetch job.  This will be able to use only C tasks, but C will 
grow as the first job opens up more slots, eventually hitting M-2 open slots.

6) at some point, the fetch job from 2) above will complete, opening up 2 map 
slots, so updatedb can be run, even in the background, allowing the execution 
to go back to 1)

Is this all correct?
If it is, or when it is, I'll stick it on the Wiki.  Without overlapping jobs 
and getting the procedure right, people running Nuthc must not be utilizing 
their clusters fully.

Did I get the numbers (M-2) right?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Andrzej Bialecki <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, April 17, 2008 4:05:42 AM
Subject: Re: Parallel operations in fetch

[EMAIL PROTECTED] wrote:

> Right.  So it sounds like it really makes most sense when one wants
> to limit things, not increase them.
> 
> I'm thinking back to the original question of parallelizing /
> overlapping different operations or steps in order to speed up the
> overall process.  It doesn't sound like there is anything in the
> generate / fetch / parse / updatedb process that one can overlap.

Quite the contrary. Generate updates the CrawlDb so that urls selected 
for the latest fetchlist become "locked out" for the next 7 days. This 
means that you can happily generate multiple fetchlists, and fetch them 
out of order, and then do the DB updates out of order, as you see fit, 
so long as you make it within the 7 days of the "lock out" period.

This means that it's practical to limit the numFetchers to a number 
below your cluster capacity, because then you can run other maintenance 
jobs in parallel with the currently running fetch job (such as updatedb 
and generate of next fetchlists).


> The only thing I can think of is that it probably pays of to have
> larger fetchlists, as it seems that it takes Generator just as much
> time to generate a larger fetchlist as it does to generate a small
> one.  Thus, with a larger fetchlist one at least avoids waiting for
> multiple Generator runs.

The observation about the time is correct, and it makes sense if you 
think about the way that Generator works. It needs to process all urls 
in the DB to examine their status, and then select a (presumably small) 
subset, so that both phases involve the processing of similar amounts of 
data, no matter what is the fetchlist size (and anyway the second phase 
is dominated by Hadoop overhead ;) ).

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to