[EMAIL PROTECTED] wrote:
Would anyone be able to help me with this or provide a
sample of how to do this kind of thing?

On Mon, 18 Apr 2005 10:38:11 -0700
 Doug Cutting <[EMAIL PROTECTED]> wrote:
You could implement a html filter plugin that looks for
these strings in the title and, when they're found,
throws a ParseException to abort the page.

It would look something like the following (untested) code:

public class TitleRegexFilter implements HtmlParseFilter {
  private Perl5Pattern pattern;
  private Perl5Matcher matcher;

  public TitleRegexFilter() {
    String regex = NutchConf.get().get("html.exclude.title.regex", "");
    this.pattern = new Perl5Compiler().compile(regex);
    this.matcher = new Perl5Matcher();
  }

  public synchronized Parse filter(Content content, Parse parse,
                                   DocumentFragment doc)
    throws ParseException {
      String title = parse.getData().getTitle();
      if (matcher.contains(pattern, title)) {
        throw new ParseException("Title rejected:" + title);
      }
    }
  }
}

You'd need to put this under src/plugin/parse-title-regex-filter, with a build.xml and a plugin.xml, add a value for html.exclude.title.regex in the config file, etc. Does this make sense?

Doug

Reply via email to