Re: Storing full HTML with nutch/solrindexer.

Andrzej Bialecki Mon, 09 Feb 2009 08:37:00 -0800

Felix Zimmermann wrote:

Hi,

I use the latest Nutch-trunk with "solrindex" (nutch for crawling and solr
for searching). My Question is: How can I store the native content of
html-pages including all tags in e.g. the Solr-field "caching"? While
indexing, the field remains empty, all other fields like "title" or
"content" works well.

Currently this is not possible out of the box, it would require somechanges to the indexer. Namely, the Content would have to be added asone of the inputs, and we would have to pass it in NutchDocument (whichcurrently handles only String values, while Content uses byte[] forpayload). Then this raw content would have to be turned into a String,or passed as is assuming you have added a BinaryFieldType extension toyour Solr ...


So, it's possible to do it but it's not a simple config switch.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Storing full HTML with nutch/solrindexer.

Reply via email to