Re: Parallelizing URLFiltering

Enzo Michelangeli Thu, 31 May 2007 07:59:34 -0700

----- Original Message -----From: "Andrzej Bialecki" <[EMAIL PROTECTED]>

Sent: Thursday, May 31, 2007 2:25 PM

Are you running jobs in the "local" mode? In distributed mode filtering isnaturally parallel, because you have as many concurrent lookups as thereare map tasks.

I'm just using the vanilla (local) configuration. The situation is so badthat lately I'm seeing durations like:


generate: 2h 48' (-topN 20000)
fetch:    1h 40' (200 threads)
updatedb: 2h 20'

This because both generate and updatedb perform filtering, and aresingle-threaded. Before I enforced filtering on updatedb, that phase lastonly few minutes. But if I don't filter in updatedb, the database getspolluted by URL's that will never be fetched.

In my experience, using multiple threads for DNS lookup doesn't help thatmuch. What helps A LOT (like several orders of magnitude) is using a localDNS cache, or even two-level DNS cache (one cache per node, one cache percluster).

I do have a local cache, but the problem is especially serious with negativeresponses, which are usually not cached - despite RFC2308).


Enzo

Re: Parallelizing URLFiltering

Reply via email to