[
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250812#comment-15250812
]
Sebastian Nagel commented on NUTCH-1785:
----------------------------------------
The class o.a.n.indexer.NutchField supports only a couple of classes as
document field value: String, Boolean, Integer, Long, Float, Date. But also
IndexWriter implementations (indexer plugins) must support all used data types,
resp. the data must provide a toString() method. In case of byte[], toString()
does not return a meaningful String (you hardly want to index {{[B@13afed55}}.
The conversion via {{new String(bytes)}} isn't stable, cf. NUTCH-1807.
However, it is a clean string, readable, though it may not preserve
bytes/characters from the original. That's probably the intention.
Maybe it's anyway better to preserve the original encoding, esp. for base64
where a String representation is defined. Please, open a new issue for your
problem. Can you give an example for the charset issue?
> Ability to index raw content
> ----------------------------
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch,
> NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured
> indexing back-end. Since Content is never read, a plugin is out of the
> question and therefore we need to force IndexJob to process Content as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)