You can do this by modifying the parse-html plugin. You'll see that the HtmlParser makes calls to DOMContentUtils to extract the text from the page. Make changes to getText() to exclude any content that you don't want.
Andy On 5/23/05, Ashit Patel <[EMAIL PROTECTED]> wrote: > Hi, > > I would like to direct Nutch to exclude parts of a > page from crawling & indexing. Is there a way to do so > using special tags/configuration? > > Thanks, > Ashit >
