[
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249465#comment-15249465
]
Federico Bonelli commented on NUTCH-1785:
-----------------------------------------
I'm experiencing charset issues with this patch, probably due to Sebastian
Nagel's remark:
bq. conversion via {code} new String(content.getContent()) {code} is needless
if base64 is true
I will now try to base64 encode the content.getContent() byte array directly,
but I was wondering about the inital intent behind the conversion back and
forth from byte[] to String and back to byte[] before base64 encoding.
{code:java}
String binary = new String(content.getContent());
// optionally encode as base64
if (base64) {
binary = Base64.encodeBase64String(StringUtils.getBytesUtf8(binary));
}
{code}
What was the inital intent behind this?
> Ability to index raw content
> ----------------------------
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
> Issue Type: New Feature
> Components: indexer
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch,
> NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured
> indexing back-end. Since Content is never read, a plugin is out of the
> question and therefore we need to force IndexJob to process Content as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)