Re: document markup to control indexing

Jack Tang Wed, 28 Dec 2005 18:37:48 -0800

Hi
I am sorry, it should be getTextHelper() method.

Say i want to index the content in this block:
<!--indexware-->
This is not Ads
<!--/indexware-->


The code may look like this:

boolean contentStart;
boolean contentEnd;

if (node.getNodeType() == Node.COMMENT_NODE) {
    // you can move the value to your configuration file.
    if("indexware".equalsIgnoreCase(node.getNodeValue())) {
         // pls config your flags
         return true; // let it go deep
    }

    .......
    return false;
}

if (contentStart && !contentEnd && node.getNodeType() == Node.TEXT_NODE) {
    // get text in <!--indexware--><!--/indexware-->

}

/Jack

On 28 Dec 2005 19:26:21 -0000, [EMAIL PROTECTED]
<[EMAIL PROTECTED]> wrote:
> Hi, I asked this question a while back and didn't get a response, so I rolled
> my own parse solution using jericho-html and and applyling it to the 
> HTMLParseFilter
> extension point.
>
> I just took a look at the getText() method of the DOMContentUtils
> class and I don't see any way to add your own custom tags ( or comment tags
> ) short of modifying the parse-html code directly and recompiling.
>
> Is
> that what is meant by adding your own filter?
>
> Thanks in advance for the
> help,
> -a
>
> --- [email protected] wrote:
> Hi Jeff
> >
> > Pls refer
> to getText() method in
> > org.apache.nutch.parse.html.DOMContentUtils class
> (of course
> > parse-html plugin). You can add your filter easily;)
> >
> >
> /Jack
> >
> > On 12/27/05, Jeff Breidenbach <[email protected]> wrote:
> > >
> > >
> Hi all,
> > >
> > > Another open source search engine, HtDig, allows web page
> authors to
> > > mark up a page such that some sections are not indexed.  The
> syntax
> > > looks like the following:
> > >
> > > <!--htdig_noindex-->
> > >
> ... material inside is not indexed ...
> > > <!--/htdig_noindex-->
> > >
> >
> > Does a similar feature exist in Nutch? If the answer is "write a
> > > plugin"
> does anyone have tips on where to start? Also, how hard is
> > > something
> like this for a Nutch newbie who doesn't know anything about
> > > HTML parsing?
> I have a bunch of documents already marked up with the
> > > htdig syntax,
> and in the interests of interoperability I'm tempted to
> > > follow the syntax
> exactly.
> > >
> > > -Jeff
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> >
> http://www.jroller.com/page/jmars
> >
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: document markup to control indexing

Reply via email to