[ 
https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573148#comment-14573148
 ] 

Luis Lopez commented on NUTCH-2032:
-----------------------------------

Hi [~wastl-nagel], could you elaborate on what seems favourable? Yes this will 
increase the size of the segments which is non trivial. I think that this 
plugin approach is less intrusive with the current class signatures. It works 
well with our use case in which we don't need the segments once that they are 
indexed.

> Plugin to index the raw content of a readable document. 
> --------------------------------------------------------
>
>                 Key: NUTCH-2032
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2032
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer, parser
>    Affects Versions: 1.10
>            Reporter: Luis Lopez
>              Labels: content, index, index-rawcontent, parser, raw
>             Fix For: 1.11
>
>
> This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
> https://issues.apache.org/jira/browse/NUTCH-1458
> We created a couple plugins to index the raw content of readable documents. 
> If we include these plugins in the plugin chain we'll index the raw content 
> of a readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent 
> plugin is not designed to index binary files, however having the full content 
> of an HTML/XML or a CSV document is really critical for some of us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to