Javier P. L. wrote:
> Hi,
>
>
> I need to modify the Nutch Indexer class because for me it is very
> useful to add some fields to the generated Lucene index. I was trying
> and I find out that it is possible to add fields to the Document with
> doc.addField() in the reduce function. My point is that for those fields
> I need the html content of the webpage to process it, but it looks not
> to be present yet in the Document because it throws a null pointer
> exception with getField("content"), maybe that is not the correct way to
> access it, or the correct place. So, How and where can I access to the
> html content of the document to add a new field to the Lucene Document
> and so on to the generated index?
>
> Any advice will be very helpful,
>
>
> Thanks in advance.
>
> Javier.
>
>
>
>
Hi,
You do not need to change the indexer code for adding new fields to the
index. You need to implement an indexing filter and add it to your
configuration during indexing. You can look at the codes of
index-basic(BasicIndexingFilter) and index-more(MoreIndexingFilter).
IndexingFilter interface has filter() method which takes document,
parse, url, CrawlDatum and inlinks as arguments, so you readily have the
content of the document to be indexed.
You can look at the tutorial on implementing a plugin from the wiki.
Best wishes.
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers