Re: [Nutch-dev] Implementing geography-by-IP filtering?

Stefan Groschupf Mon, 06 Dec 2004 16:20:10 -0800

Hi Matt, very interesting we working an a similar issue. We use a german commercial 'Zip Code to Geo Coordinates' DB. We haven't that much URLs that's why, we haven't such a big problem with performance yet, however we are in development process and didn't test that much.

We extract the zip code from the dns whois lookup. This lookup we do until indexing since we index the coordinates as well. Anyway we cache all informations in a mysql database and lookup first this database and then do the whois query. This saves a lot of DNS and Whois traffic, but you may haven't the latest information, in our case that is secondary.

Anyway I think to refactor the fetch list tool to a multi thread style would be useful in other scenarios as well. Sure there must be a singleton that writes the list but to have a multithread filter would be good. As well I would love to change the Interface based UrlFilter to a Extensionpoint based Filter, since this would allow to have multiple filter installed. (localbased and contenbased e.g. all restaurants in NY)

Stefan


Am 07.12.2004 um 00:46 schrieb Matt Kangas:

Hi folks,

A few weeks ago, I decided to create a Nutch extension that would
allow one to crawl URLs only within a certain geographic area. It
could be handy for a Canadian to build a Nutch setup that crawls all
Canadian sites, including the .com and .orgs. Or, since I'm in New
York, I'd like to search local content in the NYC area w/o needing the
disk space to crawl the entire web.

One way to do this is to IP-to-location lookup, using something like
the MaxMind.com GeoIP database. The free version resolves to the
country level, and pay versions resolve down to metroarea. So I
implemented a subclass of net.nutch.net.RegexURLFilter that does this.
(see attached)

The result, IPRegexURLFilter, works as advertised: it filters by regex
*and* country-netblock. It's also very, very slow. The reason is quite
simple. To do an IP-to-country lookup from a URL, I first have to do a
DNS lookup on the hostname, which has high latency. So the
single-threaded sections of code that call URLFilter.filter()
implementations spend most of their time waiting for the lookup to
complete.

My instincts tell me there are two way to improve this situation:
1) Move the IP-based filter into the multithreaded parts of Fetcher,
e.g. FetcherThread
2) Or, push it all the way down to where the Fetcher does its own DNS
lookup, so we eliminate duplicate lookups for each non-filtered URL

(2) would require hooking into each Protocol implementation that deals
with hostnames, e.g. protocol-http AND protocol-ftp. That seems like a
bad idea. Considering that the JVM will cache DNS requests, perhaps
it's not worth going this far to eliminate the double-lookup.

So, if (1) is a better course of action, I would need to hook into
FetcherThread.run() and run a filter before the call to
protocol.getContent(url).

What's the best way to achieve this? More importantly, what's the
Nutch way? Since FetcherThread is a inner class, subclassing it isn't
the answer. A delegate of some kind seems more appropriate. Perhaps
Fetcher could gain a URLFilter ivar, which if not null, FetcherThread
calls before protocol.getContent(url)?

I think this would be a generally-useful extension to the crawler, and
am willing to write it & submit as a patch.

Nutch committers, what do you think?

(ps: I don't work for MaxMind, I just think their product is useful.
The DB access API and GeoIP Free DB are both GPL'd)

--Matt
<ipregexurlfilter.java>

------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Implementing geography-by-IP filtering?

Reply via email to