Andrej, comments are inline... On Mon, 17 Jan 2005 13:33:37 +0100, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > While the idea of ContentFilter is very useful, I have some doubts > regarding the use of URLFilter during fetching. If you don't want to > fetch some urls, then you should not put them in the fetchlist in the > first place. In other words, I think this patch should be moved to the > FetchListTool.java, between lines 508-509.
My original rationale for adding this hook was to support URLFilter implementations that have high latency on the filter() call. The GeoIpFilter I posted in December requires a call to DNS to retrieve the IP address, then performs a local-memory lookup to see what physical location corresponds to that netblock. Initially I tried plugging it in as urlfilter.class, but the performance was terrible due to (a) FetchListTool calling it in single-threaded code, and (b) being passed, I believe, a non-sorted list of URLs. So, while I agree that URLs should be filtered as early as possible in the pipeline, I thought it was cleaner to add this hook to Fetcher than to add another thread pool to FetchListTool to accomodate URLFilters that are subject to latency. > Also, in other places we use the factory pattern to get an instance of > URLFilter, without using setters. Perhaps we should use the same pattern > here as well? Sounds good. I was just trying to do the simplest possible thing that would work. :-) > > This should provide a lot of flexibility for people who don't want to > > index the entire web. The only drawback I see is that the interface is > > too simple to be leveraged from the command-line; you'd have to make > > your own custom CrawlTool and plug in filters at the appropriate point > > in the crawl cycle. > > There is a middle-ground solution here, I think: you could implement a > simple content filter, which filters e.g. based on a regex match of the > content metadata. Regexes could be read from a text file. The filter > could be then activated from the command-line with switch, pointing to > the location of the regex file. I think this could be a useful default option for script-oriented folks. But we should make sure that people who want to write Java code can plug in something more sophisticated (Bayesian classifier, SVM, etc). ------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
