Hi, I asked this question a while back and didn't get a response, so I rolled
my own parse solution using jericho-html and and applyling it to the 
HTMLParseFilter
extension point.

I just took a look at the getText() method of the DOMContentUtils
class and I don't see any way to add your own custom tags ( or comment tags
) short of modifying the parse-html code directly and recompiling.  

Is
that what is meant by adding your own filter?

Thanks in advance for the
help,
-a

--- [email protected] wrote:
Hi Jeff
> 
> Pls refer
to getText() method in
> org.apache.nutch.parse.html.DOMContentUtils class
(of course
> parse-html plugin). You can add your filter easily;)
> 
>
/Jack
> 
> On 12/27/05, Jeff Breidenbach <[email protected]> wrote:
> >
> >
Hi all,
> >
> > Another open source search engine, HtDig, allows web page
authors to
> > mark up a page such that some sections are not indexed.  The
syntax
> > looks like the following:
> >
> > <!--htdig_noindex-->
> >
... material inside is not indexed ...
> > <!--/htdig_noindex-->
> >
>
> Does a similar feature exist in Nutch? If the answer is "write a
> > plugin"
does anyone have tips on where to start? Also, how hard is
> > something
like this for a Nutch newbie who doesn't know anything about
> > HTML parsing?
I have a bunch of documents already marked up with the
> > htdig syntax,
and in the interests of interoperability I'm tempted to
> > follow the syntax
exactly.
> >
> > -Jeff
> >
> 
> 
> --
> Keep Discovering ... ...
>
http://www.jroller.com/page/jmars
> 

Reply via email to