Re: Parallel operations in fetch

Svein Yngvar Willassen Thu, 17 Apr 2008 01:19:26 -0700

2008/4/17, Andrzej Bialecki <[EMAIL PROTECTED]>:
>
>
> The only thing I can think of is that it probably pays of to have
> > larger fetchlists, as it seems that it takes Generator just as much
> > time to generate a larger fetchlist as it does to generate a small
> > one.  Thus, with a larger fetchlist one at least avoids waiting for
> > multiple Generator runs.
> >
>
> The observation about the time is correct, and it makes sense if you think
> about the way that Generator works. It needs to process all urls in the DB
> to examine their status, and then select a (presumably small) subset, so
> that both phases involve the processing of similar amounts of data, no
> matter what is the fetchlist size (and anyway the second phase is dominated
> by Hadoop overhead ;) ).
>



I noticed that readdb has an option [<min>] to skip records with score below
that value. Perhaps it could be an idea to have this option in generate as
well? This would allow the Generator to select a much smaller subset in the
select-stage, which should speed up the Generator significantly, at least in
cases where there are many unfetched urls in the database.


-- 
Best Regards,

Svein Y. Willassen
http://willassen.blogspot.com/

Re: Parallel operations in fetch

Reply via email to