very interesting we working an a similar issue.
We use a german commercial 'Zip Code to Geo Coordinates' DB.
We haven't that much URLs that's why, we haven't such a big problem with performance yet, however we are in development process and didn't test that much.
We extract the zip code from the dns whois lookup. This lookup we do until indexing since we index the coordinates as well.
Anyway we cache all informations in a mysql database and lookup first this database and then do the whois query.
This saves a lot of DNS and Whois traffic, but you may haven't the latest information, in our case that is secondary.
Anyway I think to refactor the fetch list tool to a multi thread style would be useful in other scenarios as well.
Sure there must be a singleton that writes the list but to have a multithread filter would be good.
As well I would love to change the Interface based UrlFilter to a Extensionpoint based Filter, since this would allow to have multiple filter installed. (localbased and contenbased e.g. all restaurants in NY)
Stefan
Am 07.12.2004 um 00:46 schrieb Matt Kangas:
Hi folks,
A few weeks ago, I decided to create a Nutch extension that would allow one to crawl URLs only within a certain geographic area. It could be handy for a Canadian to build a Nutch setup that crawls all Canadian sites, including the .com and .orgs. Or, since I'm in New York, I'd like to search local content in the NYC area w/o needing the disk space to crawl the entire web.
One way to do this is to IP-to-location lookup, using something like the MaxMind.com GeoIP database. The free version resolves to the country level, and pay versions resolve down to metroarea. So I implemented a subclass of net.nutch.net.RegexURLFilter that does this. (see attached)
The result, IPRegexURLFilter, works as advertised: it filters by regex *and* country-netblock. It's also very, very slow. The reason is quite simple. To do an IP-to-country lookup from a URL, I first have to do a DNS lookup on the hostname, which has high latency. So the single-threaded sections of code that call URLFilter.filter() implementations spend most of their time waiting for the lookup to complete.
My instincts tell me there are two way to improve this situation: 1) Move the IP-based filter into the multithreaded parts of Fetcher, e.g. FetcherThread 2) Or, push it all the way down to where the Fetcher does its own DNS lookup, so we eliminate duplicate lookups for each non-filtered URL
(2) would require hooking into each Protocol implementation that deals with hostnames, e.g. protocol-http AND protocol-ftp. That seems like a bad idea. Considering that the JVM will cache DNS requests, perhaps it's not worth going this far to eliminate the double-lookup.
So, if (1) is a better course of action, I would need to hook into FetcherThread.run() and run a filter before the call to protocol.getContent(url).
What's the best way to achieve this? More importantly, what's the Nutch way? Since FetcherThread is a inner class, subclassing it isn't the answer. A delegate of some kind seems more appropriate. Perhaps Fetcher could gain a URLFilter ivar, which if not null, FetcherThread calls before protocol.getContent(url)?
I think this would be a generally-useful extension to the crawler, and am willing to write it & submit as a patch.
Nutch committers, what do you think?
(ps: I don't work for MaxMind, I just think their product is useful. The DB access API and GeoIP Free DB are both GPL'd)
--Matt <ipregexurlfilter.java>
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
