[EMAIL PROTECTED] wrote:
Right. So it sounds like it really makes most sense when one wants
to limit things, not increase them.
I'm thinking back to the original question of parallelizing /
overlapping different operations or steps in order to speed up the
overall process. It doesn't sound like there is anything in the
generate / fetch / parse / updatedb process that one can overlap.
Quite the contrary. Generate updates the CrawlDb so that urls selected
for the latest fetchlist become "locked out" for the next 7 days. This
means that you can happily generate multiple fetchlists, and fetch them
out of order, and then do the DB updates out of order, as you see fit,
so long as you make it within the 7 days of the "lock out" period.
This means that it's practical to limit the numFetchers to a number
below your cluster capacity, because then you can run other maintenance
jobs in parallel with the currently running fetch job (such as updatedb
and generate of next fetchlists).
The only thing I can think of is that it probably pays of to have
larger fetchlists, as it seems that it takes Generator just as much
time to generate a larger fetchlist as it does to generate a small
one. Thus, with a larger fetchlist one at least avoids waiting for
multiple Generator runs.
The observation about the time is correct, and it makes sense if you
think about the way that Generator works. It needs to process all urls
in the DB to examine their status, and then select a (presumably small)
subset, so that both phases involve the processing of similar amounts of
data, no matter what is the fetchlist size (and anyway the second phase
is dominated by Hadoop overhead ;) ).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com