Look at the source code for the basic indexing plugin - it indexes the title tags and some other tags: should be a good starting point.

Eric

On Oct 5, 2009, at 1:20 PM, BELLINI ADAM wrote:


hi,

but how will i get the HTML <div> tag ?
is there any nutch method to get from the content the <div> tag ??
thx




Subject: Re: indexing just certain content
From: e...@lakemeadonline.com
Date: Mon, 5 Oct 2009 13:09:17 -0700
To: nutch-user@lucene.apache.org

Adam,

You could turn off all the indexing plugins and write your own plugin
that only indexes certain meta content from your intranet - giving you
complete control of the fields indexed.

Eric

On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:


hi

does anybody know if it's possible to index just certain content ? i
mean i need to dont index some garbage and repetitive data on my
intranet.

in other way if it is possible to tell the indexer dont index the
content between  certain <div> tags
like:

<div id="bla bla">


plz dont index this  bla  bla bla

</div>

thx to all
                                        
_________________________________________________________________
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403

                                        
_________________________________________________________________
Click less, chat more: Messenger on MSN.ca
http://go.microsoft.com/?linkid=9677404

Reply via email to