[jira] [Commented] (NUTCH-2032) Plugin to index the raw content of a readable document.

Sebastian Nagel (JIRA) Wed, 03 Jun 2015 14:39:13 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571694#comment-14571694
 ]


Sebastian Nagel commented on NUTCH-2032:
----------------------------------------

Hi [~betolink], your solution/patch already adds the raw HTML to parse meta 
data. No changes to indexer (that's an advantage) at the price of storing the 
raw content twice in the segment. Since we can change the indexer the solution 
from NUTCH-1785 seems favourable?

> Plugin to index the raw content of a readable document. 
> --------------------------------------------------------
>
>                 Key: NUTCH-2032
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2032
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>              Labels: content, index, index-rawcontent, parser, raw
>             Fix For: 1.11
>
>
> This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
> https://issues.apache.org/jira/browse/NUTCH-1458
> We created a couple plugins to index the raw content of readable documents. 
> If we include these plugins in the plugin chain we'll index the raw content 
> of a readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent 
> plugin is not designed to index binary files, however having the full content 
> of an HTML/XML or a CSV document is really critical for some of us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2032) Plugin to index the raw content of a readable document.

Reply via email to