Stefan and/or Doug,
Here's a followup to my Jan 3 diff. This time I added two hooks to the Fetcher, for URLFilter and also for a new interface, ContentFilter. These allow one to: - filter out URLs prior to fetching, and - filter out fetched content prior to writing to a segment
While the idea of ContentFilter is very useful, I have some doubts regarding the use of URLFilter during fetching. If you don't want to fetch some urls, then you should not put them in the fetchlist in the first place. In other words, I think this patch should be moved to the FetchListTool.java, between lines 508-509.
Also, in other places we use the factory pattern to get an instance of URLFilter, without using setters. Perhaps we should use the same pattern here as well?
This should provide a lot of flexibility for people who don't want to index the entire web. The only drawback I see is that the interface is too simple to be leveraged from the command-line; you'd have to make your own custom CrawlTool and plug in filters at the appropriate point in the crawl cycle.
There is a middle-ground solution here, I think: you could implement a simple content filter, which filters e.g. based on a regex match of the content metadata. Regexes could be read from a text file. The filter could be then activated from the command-line with switch, pointing to the location of the regex file.
-- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
