Look at the source code for the basic indexing plugin - it indexes the
title tags and some other tags: should be a good starting point.
Eric
On Oct 5, 2009, at 1:20 PM, BELLINI ADAM wrote:
hi,
but how will i get the HTML <div> tag ?
is there any nutch method to get from the content the <div> tag ??
thx
Subject: Re: indexing just certain content
From: e...@lakemeadonline.com
Date: Mon, 5 Oct 2009 13:09:17 -0700
To: nutch-user@lucene.apache.org
Adam,
You could turn off all the indexing plugins and write your own plugin
that only indexes certain meta content from your intranet - giving
you
complete control of the fields indexed.
Eric
On Oct 5, 2009, at 1:06 PM, BELLINI ADAM wrote:
hi
does anybody know if it's possible to index just certain content ? i
mean i need to dont index some garbage and repetitive data on my
intranet.
in other way if it is possible to tell the indexer dont index the
content between certain <div> tags
like:
<div id="bla bla">
plz dont index this bla bla bla
</div>
thx to all
_________________________________________________________________
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403
_________________________________________________________________
Click less, chat more: Messenger on MSN.ca
http://go.microsoft.com/?linkid=9677404