Svein Yngvar Willassen wrote:

I noticed that readdb has an option [<min>] to skip records with score below
that value. Perhaps it could be an idea to have this option in generate as
well? This would allow the Generator to select a much smaller subset in the
select-stage, which should speed up the Generator significantly, at least in
cases where there are many unfetched urls in the database.

In my experience the time it takes to execute the second job in Generator is similar for fetchlists between 0-500k urls. The reason is the Hadoop overhead. So I'm not convinced about the real benefit of this option in terms of significantly reduced time.

Also, you need to keep in mind that if you implement this option then some urls may not be fetched ever, because their score is too low.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to