Would anyone be able to help me with this or provide a sample of how to do this kind of thing?
On Mon, 18 Apr 2005 10:38:11 -0700 Doug Cutting <[EMAIL PROTECTED]> wrote:You could implement a html filter plugin that looks for these strings in the title and, when they're found, throws a ParseException to abort the page.
It would look something like the following (untested) code:
public class TitleRegexFilter implements HtmlParseFilter {
private Perl5Pattern pattern;
private Perl5Matcher matcher; public TitleRegexFilter() {
String regex = NutchConf.get().get("html.exclude.title.regex", "");
this.pattern = new Perl5Compiler().compile(regex);
this.matcher = new Perl5Matcher();
} public synchronized Parse filter(Content content, Parse parse,
DocumentFragment doc)
throws ParseException {
String title = parse.getData().getTitle();
if (matcher.contains(pattern, title)) {
throw new ParseException("Title rejected:" + title);
}
}
}
}You'd need to put this under src/plugin/parse-title-regex-filter, with a build.xml and a plugin.xml, add a value for html.exclude.title.regex in the config file, etc. Does this make sense?
Doug
