Re: [Nutch-dev] Modifiying Nutch Indexer

Javier P. L. Thu, 09 Nov 2006 02:33:16 -0800

El mar, 07-11-2006 a las 15:01 +0200, Enis Soztutar escribió: 
> Javier P. L. wrote:
> > Hi, 
> >
> >
> > I need to modify the Nutch Indexer class because for me it is very
> > useful to add some fields to the generated Lucene index. I was trying
> > and I find out that it is possible to add fields to the Document with
> > doc.addField() in the reduce function. My point is that for those fields
> > I need the html content of the webpage to process it, but it looks not
> > to be present yet in the Document because it throws a null pointer
> > exception with getField("content"), maybe that is not the correct way to
> > access it, or the correct place. So, How and where can I access to the
> > html content of the document to add a new field to the Lucene Document
> > and so on to the generated index?
> >
> > Any advice will be very helpful, 
> >
> >
> > Thanks in advance. 
> >
> > Javier.
> >
> >
> >
> >   
> Hi,
> 
> You do not need to change the indexer code for adding new fields to the 
> index. You need to implement an indexing filter and add it to your 
> configuration during indexing. You can look at the codes of 
> index-basic(BasicIndexingFilter) and index-more(MoreIndexingFilter). 
> IndexingFilter interface has filter() method which takes document, 
> parse, url, CrawlDatum and inlinks as arguments, so you readily have the 
> content of the document to be indexed.
> 
> You can look at the tutorial on implementing a plugin from the wiki.
> 
> Best wishes.
> 
>


Thanks for the help, I did what you said, but now I have a question,
from where can I extract the html code of the document, i.e. the
equivalent to bean.getContent(details) ?. Because I need it for the new
fields that I want to add in the index plugin. I tried from Parse, from
and from CrawlDatum, but the most that I got was the parsed text from
the html code. Does anyone know how to get it?. 


Thanks in advance,

Javier


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Modifiying Nutch Indexer

Reply via email to