----- Original Message -----
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
Sent: Thursday, May 31, 2007 2:25 PM
Are you running jobs in the "local" mode? In distributed mode filtering is
naturally parallel, because you have as many concurrent lookups as there
are map tasks.
I'm just using the vanilla (local) configuration. The situation is so bad
that lately I'm seeing durations like:
generate: 2h 48' (-topN 20000)
fetch: 1h 40' (200 threads)
updatedb: 2h 20'
This because both generate and updatedb perform filtering, and are
single-threaded. Before I enforced filtering on updatedb, that phase last
only few minutes. But if I don't filter in updatedb, the database gets
polluted by URL's that will never be fetched.
In my experience, using multiple threads for DNS lookup doesn't help that
much. What helps A LOT (like several orders of magnitude) is using a local
DNS cache, or even two-level DNS cache (one cache per node, one cache per
cluster).
I do have a local cache, but the problem is especially serious with negative
responses, which are usually not cached - despite RFC2308).
Enzo