Re: Parallel operations in fetch

Andrzej Bialecki Thu, 17 Apr 2008 01:06:32 -0700

[EMAIL PROTECTED] wrote:

Right.  So it sounds like it really makes most sense when one wants
to limit things, not increase them.


I'm thinking back to the original question of parallelizing /
overlapping different operations or steps in order to speed up the
overall process.  It doesn't sound like there is anything in the
generate / fetch / parse / updatedb process that one can overlap.

Quite the contrary. Generate updates the CrawlDb so that urls selectedfor the latest fetchlist become "locked out" for the next 7 days. Thismeans that you can happily generate multiple fetchlists, and fetch themout of order, and then do the DB updates out of order, as you see fit,so long as you make it within the 7 days of the "lock out" period.

This means that it's practical to limit the numFetchers to a numberbelow your cluster capacity, because then you can run other maintenancejobs in parallel with the currently running fetch job (such as updatedband generate of next fetchlists).

The only thing I can think of is that it probably pays of to have
larger fetchlists, as it seems that it takes Generator just as much
time to generate a larger fetchlist as it does to generate a small
one.  Thus, with a larger fetchlist one at least avoids waiting for
multiple Generator runs.

The observation about the time is correct, and it makes sense if youthink about the way that Generator works. It needs to process all urlsin the DB to examine their status, and then select a (presumably small)subset, so that both phases involve the processing of similar amounts ofdata, no matter what is the fetchlist size (and anyway the second phaseis dominated by Hadoop overhead ;) ).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Parallel operations in fetch

Reply via email to