2008/4/17, Andrzej Bialecki <[EMAIL PROTECTED]>: > > > The only thing I can think of is that it probably pays of to have > > larger fetchlists, as it seems that it takes Generator just as much > > time to generate a larger fetchlist as it does to generate a small > > one. Thus, with a larger fetchlist one at least avoids waiting for > > multiple Generator runs. > > > > The observation about the time is correct, and it makes sense if you think > about the way that Generator works. It needs to process all urls in the DB > to examine their status, and then select a (presumably small) subset, so > that both phases involve the processing of similar amounts of data, no > matter what is the fetchlist size (and anyway the second phase is dominated > by Hadoop overhead ;) ). >
I noticed that readdb has an option [<min>] to skip records with score below that value. Perhaps it could be an idea to have this option in generate as well? This would allow the Generator to select a much smaller subset in the select-stage, which should speed up the Generator significantly, at least in cases where there are many unfetched urls in the database. -- Best Regards, Svein Y. Willassen http://willassen.blogspot.com/
