Re: Parallelizing URLFiltering

Andrzej Bialecki Thu, 31 May 2007 08:40:20 -0700

Enzo Michelangeli wrote:

I'm just using the vanilla (local) configuration. The situation is sobad that lately I'm seeing durations like:
generate: 2h 48' (-topN 20000)
fetch:    1h 40' (200 threads)
updatedb: 2h 20'
This because both generate and updatedb perform filtering, and aresingle-threaded. Before I enforced filtering on updatedb, that phaselast only few minutes. But if I don't filter in updatedb, the databasegets polluted by URL's that will never be fetched.

Caching seems to be the only solution. Even if you were able to fire DNSrequests more rapidly, remote servers wouldn't be able (or wouldn't liketo) respond that quickly ...

In my experience, using multiple threads for DNS lookup doesn't helpthat much. What helps A LOT (like several orders of magnitude) isusing a local DNS cache, or even two-level DNS cache (one cache pernode, one cache per cluster).
I do have a local cache, but the problem is especially serious withnegative responses, which are usually not cached - despite RFC2308).

Which DNS cache implementation are you using? I've had positiveexperience with djbdns / tinydns package, with some modifications toincrease the number of concurrent requests and the cache size. This wason Linux, though - I have no idea how to do this on Windows.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Parallelizing URLFiltering

Reply via email to