[jira] [Commented] (NUTCH-1785) Ability to index raw content

Sebastian Nagel (JIRA) Wed, 20 Apr 2016 15:08:36 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250812#comment-15250812
 ]


Sebastian Nagel commented on NUTCH-1785:
----------------------------------------

The class o.a.n.indexer.NutchField supports only a couple of classes as 
document field value: String, Boolean, Integer, Long, Float, Date.  But also 
IndexWriter implementations (indexer plugins) must support all used data types, 
resp. the data must provide a toString() method. In case of byte[], toString() 
does not return a meaningful String (you hardly want to index {{[B@13afed55}}.  
The conversion via {{new String(bytes)}} isn't stable, cf. NUTCH-1807.  
However, it is a clean string, readable, though it may not preserve 
bytes/characters from the original.  That's probably the intention.

Maybe it's anyway better to preserve the original encoding, esp. for base64 
where a String representation is defined.  Please, open a new issue for your 
problem.  Can you give an example for the charset issue?

> Ability to index raw content
> ----------------------------
>
>                 Key: NUTCH-1785
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1785
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1785) Ability to index raw content

Reply via email to