Hi Jeff

Pls refer to getText() method in
org.apache.nutch.parse.html.DOMContentUtils class (of course
parse-html plugin). You can add your filter easily;)

/Jack

On 12/27/05, Jeff Breidenbach <[email protected]> wrote:
>
> Hi all,
>
> Another open source search engine, HtDig, allows web page authors to
> mark up a page such that some sections are not indexed.  The syntax
> looks like the following:
>
> <!--htdig_noindex-->
> ... material inside is not indexed ...
> <!--/htdig_noindex-->
>
> Does a similar feature exist in Nutch? If the answer is "write a
> plugin" does anyone have tips on where to start? Also, how hard is
> something like this for a Nutch newbie who doesn't know anything about
> HTML parsing? I have a bunch of documents already marked up with the
> htdig syntax, and in the interests of interoperability I'm tempted to
> follow the syntax exactly.
>
> -Jeff
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Reply via email to