Andrej, comments are inline...

On Mon, 17 Jan 2005 13:33:37 +0100, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> While the idea of ContentFilter is very useful, I have some doubts
> regarding the use of URLFilter during fetching. If you don't want to
> fetch some urls, then you should not put them in the fetchlist in the
> first place. In other words, I think this patch should be moved to the
> FetchListTool.java, between lines 508-509.

My original rationale for adding this hook was to support URLFilter
implementations that have high latency on the filter() call. The
GeoIpFilter I posted in December requires a call to DNS to retrieve
the IP address, then performs a local-memory lookup to see what
physical location corresponds to that netblock. Initially I tried
plugging it in as urlfilter.class, but the performance was terrible
due to (a) FetchListTool calling it in single-threaded code, and (b)
being passed, I believe, a non-sorted list of URLs.

So, while I agree that URLs should be filtered as early as possible in
the pipeline, I thought it was cleaner to add this hook to Fetcher
than to add another thread pool to FetchListTool to accomodate
URLFilters that are subject to latency.

> Also, in other places we use the factory pattern to get an instance of
> URLFilter, without using setters. Perhaps we should use the same pattern
> here as well?

Sounds good. I was just trying to do the simplest possible thing that
would work. :-)

> > This should provide a lot of flexibility for people who don't want to
> > index the entire web. The only drawback I see is that the interface is
> > too simple to be leveraged from the command-line; you'd have to make
> > your own custom CrawlTool and plug in filters at the appropriate point
> > in the crawl cycle.
> 
> There is a middle-ground solution here, I think: you could implement a
> simple content filter, which filters e.g. based on a regex match of the
> content metadata. Regexes could be read from a text file. The filter
> could be then activated from the command-line with switch, pointing to
> the location of the regex file.

I think this could be a useful default option for script-oriented
folks. But we should  make sure that people who want to write Java
code can plug in something more sophisticated (Bayesian classifier,
SVM, etc).


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to