Hi Stefan,

I posted this message to the list and then, unfortunately, disappeared
from the net. Poor form on my part.  (Warning: Real Work can pre-empt
my Nutch hacking at any time. :)

Your thinking is very close to mine on this topic. There are at least
two logical "filter points" that would be very helpful to have the
Fetcher: (a) immediately before the fetch, for IP-based filters, and
(b) immediately after, for content-based filters. So, as you say, I
could restrict the crawl to the NYC area and index only
restaurant-related pages.

Implementing this with UrlFilter is a good start. ExtensionPoint could
be even better if it allows stackable filters w/ low overhead. My one
concern is that filters should not be controlled only through a config
file. For example, if I wrote a Java-based scheduler to control
several Nutch crawlers (under JMX, perhaps), I'd want to run them
inside one JVM. But current Nutch filters apply globally within the
scope of a JVM, so that isn't possible.

Generally, my vision is that the Nutch's crawler should be as
configurable and modular as the Apache httpd is. I think these
additions would be a few good steps along this path. A subclassable
CrawlTool would be another.

What do you think is the next step? Should I simply write an
implementation and post it to the list?
(Time permitting, of course. :)

--Matt

On Tue, 7 Dec 2004 01:19:02 +0100, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Matt,
> very interesting we working an a similar issue.
> We use a german commercial 'Zip Code to Geo Coordinates' DB.
> We haven't that much URLs that's why, we haven't such a big problem
> with performance yet, however we are in development process and didn't
> test that much.
> 
> We extract the zip code from the dns whois lookup. This lookup we do
> until indexing since we index the coordinates as well.
> Anyway we cache all informations in a mysql database and lookup first
> this database and then do the whois query.
> This saves a lot of DNS and Whois traffic, but you may haven't the
> latest information, in our case that is secondary.
> 
> Anyway I think to refactor the fetch list tool to a multi thread style
> would be useful in other scenarios as well.
> Sure there must be a singleton that writes the list but to have a
> multithread filter would be good.
> As well I would love to change the Interface based UrlFilter to a
> Extensionpoint based Filter, since this would allow to have multiple
> filter installed. (localbased and contenbased e.g. all restaurants in
> NY)
> 
> Stefan
> 
> Am 07.12.2004 um 00:46 schrieb Matt Kangas:
> 
> > Hi folks,
> >
> > A few weeks ago, I decided to create a Nutch extension that would
> > allow one to crawl URLs only within a certain geographic area. It
> > could be handy for a Canadian to build a Nutch setup that crawls all
> > Canadian sites, including the .com and .orgs. Or, since I'm in New
> > York, I'd like to search local content in the NYC area w/o needing the
> > disk space to crawl the entire web.
> >
> > One way to do this is to IP-to-location lookup, using something like
> > the MaxMind.com GeoIP database. The free version resolves to the
> > country level, and pay versions resolve down to metroarea. So I
> > implemented a subclass of net.nutch.net.RegexURLFilter that does this.
> > (see attached)
> >
> > The result, IPRegexURLFilter, works as advertised: it filters by regex
> > *and* country-netblock. It's also very, very slow. The reason is quite
> > simple. To do an IP-to-country lookup from a URL, I first have to do a
> > DNS lookup on the hostname, which has high latency. So the
> > single-threaded sections of code that call URLFilter.filter()
> > implementations spend most of their time waiting for the lookup to
> > complete.
> >
> > My instincts tell me there are two way to improve this situation:
> > 1) Move the IP-based filter into the multithreaded parts of Fetcher,
> > e.g. FetcherThread
> > 2) Or, push it all the way down to where the Fetcher does its own DNS
> > lookup, so we eliminate duplicate lookups for each non-filtered URL
> >
> > (2) would require hooking into each Protocol implementation that deals
> > with hostnames, e.g. protocol-http AND protocol-ftp. That seems like a
> > bad idea. Considering that the JVM will cache DNS requests, perhaps
> > it's not worth going this far to eliminate the double-lookup.
> >
> > So, if (1) is a better course of action, I would need to hook into
> > FetcherThread.run() and run a filter before the call to
> > protocol.getContent(url).
> >
> > What's the best way to achieve this? More importantly, what's the
> > Nutch way? Since FetcherThread is a inner class, subclassing it isn't
> > the answer. A delegate of some kind seems more appropriate. Perhaps
> > Fetcher could gain a URLFilter ivar, which if not null, FetcherThread
> > calls before protocol.getContent(url)?
> >
> > I think this would be a generally-useful extension to the crawler, and
> > am willing to write it & submit as a patch.
> >
> > Nutch committers, what do you think?
> >
> > (ps: I don't work for MaxMind, I just think their product is useful.
> > The DB access API and GeoIP Free DB are both GPL'd)
> >
> > --Matt
> > <ipregexurlfilter.java>
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to