Hi
I am sorry, it should be getTextHelper() method.
Say i want to index the content in this block:
<!--indexware-->
This is not Ads
<!--/indexware-->
The code may look like this:
boolean contentStart;
boolean contentEnd;
if (node.getNodeType() == Node.COMMENT_NODE) {
// you can move the value to your configuration file.
if("indexware".equalsIgnoreCase(node.getNodeValue())) {
// pls config your flags
return true; // let it go deep
}
.......
return false;
}
if (contentStart && !contentEnd && node.getNodeType() == Node.TEXT_NODE) {
// get text in <!--indexware--><!--/indexware-->
}
/Jack
On 28 Dec 2005 19:26:21 -0000, [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> Hi, I asked this question a while back and didn't get a response, so I rolled
> my own parse solution using jericho-html and and applyling it to the
> HTMLParseFilter
> extension point.
>
> I just took a look at the getText() method of the DOMContentUtils
> class and I don't see any way to add your own custom tags ( or comment tags
> ) short of modifying the parse-html code directly and recompiling.
>
> Is
> that what is meant by adding your own filter?
>
> Thanks in advance for the
> help,
> -a
>
> --- [email protected] wrote:
> Hi Jeff
> >
> > Pls refer
> to getText() method in
> > org.apache.nutch.parse.html.DOMContentUtils class
> (of course
> > parse-html plugin). You can add your filter easily;)
> >
> >
> /Jack
> >
> > On 12/27/05, Jeff Breidenbach <[email protected]> wrote:
> > >
> > >
> Hi all,
> > >
> > > Another open source search engine, HtDig, allows web page
> authors to
> > > mark up a page such that some sections are not indexed. The
> syntax
> > > looks like the following:
> > >
> > > <!--htdig_noindex-->
> > >
> ... material inside is not indexed ...
> > > <!--/htdig_noindex-->
> > >
> >
> > Does a similar feature exist in Nutch? If the answer is "write a
> > > plugin"
> does anyone have tips on where to start? Also, how hard is
> > > something
> like this for a Nutch newbie who doesn't know anything about
> > > HTML parsing?
> I have a bunch of documents already marked up with the
> > > htdig syntax,
> and in the interests of interoperability I'm tempted to
> > > follow the syntax
> exactly.
> > >
> > > -Jeff
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> >
> http://www.jroller.com/page/jmars
> >
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars