Hi Stefan, I posted this message to the list and then, unfortunately, disappeared from the net. Poor form on my part. (Warning: Real Work can pre-empt my Nutch hacking at any time. :)
Your thinking is very close to mine on this topic. There are at least two logical "filter points" that would be very helpful to have the Fetcher: (a) immediately before the fetch, for IP-based filters, and (b) immediately after, for content-based filters. So, as you say, I could restrict the crawl to the NYC area and index only restaurant-related pages. Implementing this with UrlFilter is a good start. ExtensionPoint could be even better if it allows stackable filters w/ low overhead. My one concern is that filters should not be controlled only through a config file. For example, if I wrote a Java-based scheduler to control several Nutch crawlers (under JMX, perhaps), I'd want to run them inside one JVM. But current Nutch filters apply globally within the scope of a JVM, so that isn't possible. Generally, my vision is that the Nutch's crawler should be as configurable and modular as the Apache httpd is. I think these additions would be a few good steps along this path. A subclassable CrawlTool would be another. What do you think is the next step? Should I simply write an implementation and post it to the list? (Time permitting, of course. :) --Matt On Tue, 7 Dec 2004 01:19:02 +0100, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > Hi Matt, > very interesting we working an a similar issue. > We use a german commercial 'Zip Code to Geo Coordinates' DB. > We haven't that much URLs that's why, we haven't such a big problem > with performance yet, however we are in development process and didn't > test that much. > > We extract the zip code from the dns whois lookup. This lookup we do > until indexing since we index the coordinates as well. > Anyway we cache all informations in a mysql database and lookup first > this database and then do the whois query. > This saves a lot of DNS and Whois traffic, but you may haven't the > latest information, in our case that is secondary. > > Anyway I think to refactor the fetch list tool to a multi thread style > would be useful in other scenarios as well. > Sure there must be a singleton that writes the list but to have a > multithread filter would be good. > As well I would love to change the Interface based UrlFilter to a > Extensionpoint based Filter, since this would allow to have multiple > filter installed. (localbased and contenbased e.g. all restaurants in > NY) > > Stefan > > Am 07.12.2004 um 00:46 schrieb Matt Kangas: > > > Hi folks, > > > > A few weeks ago, I decided to create a Nutch extension that would > > allow one to crawl URLs only within a certain geographic area. It > > could be handy for a Canadian to build a Nutch setup that crawls all > > Canadian sites, including the .com and .orgs. Or, since I'm in New > > York, I'd like to search local content in the NYC area w/o needing the > > disk space to crawl the entire web. > > > > One way to do this is to IP-to-location lookup, using something like > > the MaxMind.com GeoIP database. The free version resolves to the > > country level, and pay versions resolve down to metroarea. So I > > implemented a subclass of net.nutch.net.RegexURLFilter that does this. > > (see attached) > > > > The result, IPRegexURLFilter, works as advertised: it filters by regex > > *and* country-netblock. It's also very, very slow. The reason is quite > > simple. To do an IP-to-country lookup from a URL, I first have to do a > > DNS lookup on the hostname, which has high latency. So the > > single-threaded sections of code that call URLFilter.filter() > > implementations spend most of their time waiting for the lookup to > > complete. > > > > My instincts tell me there are two way to improve this situation: > > 1) Move the IP-based filter into the multithreaded parts of Fetcher, > > e.g. FetcherThread > > 2) Or, push it all the way down to where the Fetcher does its own DNS > > lookup, so we eliminate duplicate lookups for each non-filtered URL > > > > (2) would require hooking into each Protocol implementation that deals > > with hostnames, e.g. protocol-http AND protocol-ftp. That seems like a > > bad idea. Considering that the JVM will cache DNS requests, perhaps > > it's not worth going this far to eliminate the double-lookup. > > > > So, if (1) is a better course of action, I would need to hook into > > FetcherThread.run() and run a filter before the call to > > protocol.getContent(url). > > > > What's the best way to achieve this? More importantly, what's the > > Nutch way? Since FetcherThread is a inner class, subclassing it isn't > > the answer. A delegate of some kind seems more appropriate. Perhaps > > Fetcher could gain a URLFilter ivar, which if not null, FetcherThread > > calls before protocol.getContent(url)? > > > > I think this would be a generally-useful extension to the crawler, and > > am willing to write it & submit as a patch. > > > > Nutch committers, what do you think? > > > > (ps: I don't work for MaxMind, I just think their product is useful. > > The DB access API and GeoIP Free DB are both GPL'd) > > > > --Matt > > <ipregexurlfilter.java> > ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://productguide.itmanagersjournal.com/ _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
