I think the http://wiki.apache.org/nutch/WritingPluginExample tutorial shows how to implement the Filter - you would be filtering the 'content' metatag instead of the 'recommended'. Then it is up to you what other Filters you enable/disable. Also look at the org.creativecommons.nutch.CCDeleteUnlicensedTool for an example of deleting Pages from the Index missing certain Fields.
Rgrds, Thomas On 2/13/06, Sunnyvale Fl <[EMAIL PROTECTED]> wrote: > > I'd like to have nutch index only contents within certain metatags; > essentially, contents that matter would appear inside a <content> tag in > html format. I am thinking of adding a htmlfilter to filter out the > content > tag, but I would also need to augment the nutch Document to erase > everything > that are non <content> - is that right? thanks! > >
