Hi, I asked this question a while back and didn't get a response, so I rolled my own parse solution using jericho-html and and applyling it to the HTMLParseFilter extension point.
I just took a look at the getText() method of the DOMContentUtils class and I don't see any way to add your own custom tags ( or comment tags ) short of modifying the parse-html code directly and recompiling. Is that what is meant by adding your own filter? Thanks in advance for the help, -a --- [email protected] wrote: Hi Jeff > > Pls refer to getText() method in > org.apache.nutch.parse.html.DOMContentUtils class (of course > parse-html plugin). You can add your filter easily;) > > /Jack > > On 12/27/05, Jeff Breidenbach <[email protected]> wrote: > > > > Hi all, > > > > Another open source search engine, HtDig, allows web page authors to > > mark up a page such that some sections are not indexed. The syntax > > looks like the following: > > > > <!--htdig_noindex--> > > ... material inside is not indexed ... > > <!--/htdig_noindex--> > > > > Does a similar feature exist in Nutch? If the answer is "write a > > plugin" does anyone have tips on where to start? Also, how hard is > > something like this for a Nutch newbie who doesn't know anything about > > HTML parsing? I have a bunch of documents already marked up with the > > htdig syntax, and in the interests of interoperability I'm tempted to > > follow the syntax exactly. > > > > -Jeff > > > > > -- > Keep Discovering ... ... > http://www.jroller.com/page/jmars >
