[Nutch-general] Re: How do I exclude portions of the HTML content from being indexed

Andy Liu Tue, 24 May 2005 09:36:53 -0700

You can do this by modifying the parse-html plugin.  You'll see that
the HtmlParser makes calls to DOMContentUtils to extract the text from
the page.  Make changes to getText() to exclude any content that you
don't want.


Andy

On 5/23/05, Ashit Patel <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> I would like to direct Nutch to exclude parts of a
> page from crawling & indexing. Is there a way to do so
> using special tags/configuration?
> 
> Thanks,
> Ashit
>


-------------------------------------------------------
This SF.Net email is sponsored by Yahoo.
Introducing Yahoo! Search Developer Network - Create apps using Yahoo!
Search APIs Find out how you can build Yahoo! directly into your own
Applications - visit http://developer.yahoo.net/?fr=offad-ysdn-ostg-q22005
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: How do I exclude portions of the HTML content from being indexed

Reply via email to