Javier P. L. wrote:
Hi,

I need to modify the Nutch Indexer class because for me it is very
useful to add some fields to the generated Lucene index. I was trying
and I find out that it is possible to add fields to the Document with
doc.addField() in the reduce function. My point is that for those fields
I need the html content of the webpage to process it, but it looks not
to be present yet in the Document because it throws a null pointer
exception with getField("content"), maybe that is not the correct way to
access it, or the correct place. So, How and where can I access to the
html content of the document to add a new field to the Lucene Document
and so on to the generated index?

Any advice will be very helpful,

Thanks in advance.
Javier.



Hi,

You do not need to change the indexer code for adding new fields to the index. You need to implement an indexing filter and add it to your configuration during indexing. You can look at the codes of index-basic(BasicIndexingFilter) and index-more(MoreIndexingFilter). IndexingFilter interface has filter() method which takes document, parse, url, CrawlDatum and inlinks as arguments, so you readily have the content of the document to be indexed.

You can look at the tutorial on implementing a plugin from the wiki.

Best wishes.


Reply via email to