Re: Parallelizing URLFiltering

Dennis Kubes Thu, 31 May 2007 08:27:04 -0700

We setup an /etc/resolv.conf configuration as shown below. This allowsus to check first local then two of the major DNS caches on the internetbefore requesting it through a local DNS caching server. The 208addresses are OpenDNS servers and the 4.x addresses are Verizon DNSservers. All of the servers are open to the public.


search domain.com
nameserver 127.0.0.1
nameserver 208.67.222.222
nameserver 208.67.220.220
nameserver 4.2.2.1
nameserver 4.2.2.2
nameserver 4.2.2.3
nameserver 4.2.2.4
nameserver 4.2.2.5


Dennis Kubes

Enzo Michelangeli wrote:

----- Original Message ----- From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
Sent: Thursday, May 31, 2007 2:25 PM
Are you running jobs in the "local" mode? In distributed modefiltering is naturally parallel, because you have as many concurrentlookups as there are map tasks.
I'm just using the vanilla (local) configuration. The situation is sobad that lately I'm seeing durations like:
generate: 2h 48' (-topN 20000)
fetch:    1h 40' (200 threads)
updatedb: 2h 20'
This because both generate and updatedb perform filtering, and aresingle-threaded. Before I enforced filtering on updatedb, that phaselast only few minutes. But if I don't filter in updatedb, the databasegets polluted by URL's that will never be fetched.
In my experience, using multiple threads for DNS lookup doesn't helpthat much. What helps A LOT (like several orders of magnitude) isusing a local DNS cache, or even two-level DNS cache (one cache pernode, one cache per cluster).
I do have a local cache, but the problem is especially serious withnegative responses, which are usually not cached - despite RFC2308).
Enzo

Re: Parallelizing URLFiltering

Reply via email to