Hi Jeff Pls refer to getText() method in org.apache.nutch.parse.html.DOMContentUtils class (of course parse-html plugin). You can add your filter easily;)
/Jack On 12/27/05, Jeff Breidenbach <[email protected]> wrote: > > Hi all, > > Another open source search engine, HtDig, allows web page authors to > mark up a page such that some sections are not indexed. The syntax > looks like the following: > > <!--htdig_noindex--> > ... material inside is not indexed ... > <!--/htdig_noindex--> > > Does a similar feature exist in Nutch? If the answer is "write a > plugin" does anyone have tips on where to start? Also, how hard is > something like this for a Nutch newbie who doesn't know anything about > HTML parsing? I have a bunch of documents already marked up with the > htdig syntax, and in the interests of interoperability I'm tempted to > follow the syntax exactly. > > -Jeff > -- Keep Discovering ... ... http://www.jroller.com/page/jmars
