Felix Zimmermann wrote:
Hi,
I use the latest Nutch-trunk with "solrindex" (nutch for crawling and solr
for searching). My Question is: How can I store the native content of
html-pages including all tags in e.g. the Solr-field "caching"? While
indexing, the field remains empty, all other fields like "title" or
"content" works well.
Currently this is not possible out of the box, it would require some
changes to the indexer. Namely, the Content would have to be added as
one of the inputs, and we would have to pass it in NutchDocument (which
currently handles only String values, while Content uses byte[] for
payload). Then this raw content would have to be turned into a String,
or passed as is assuming you have added a BinaryFieldType extension to
your Solr ...
So, it's possible to do it but it's not a simple config switch.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com