Felix Zimmermann wrote:
Hi,

I use the latest Nutch-trunk with "solrindex" (nutch for crawling and solr
for searching). My Question is: How can I store the native content of
html-pages including all tags in e.g. the Solr-field "caching"? While
indexing, the field remains empty, all other fields like "title" or
"content" works well.

Currently this is not possible out of the box, it would require some changes to the indexer. Namely, the Content would have to be added as one of the inputs, and we would have to pass it in NutchDocument (which currently handles only String values, while Content uses byte[] for payload). Then this raw content would have to be turned into a String, or passed as is assuming you have added a BinaryFieldType extension to your Solr ...

So, it's possible to do it but it's not a simple config switch.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to