[
https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573148#comment-14573148
]
Luis Lopez commented on NUTCH-2032:
-----------------------------------
Hi [~wastl-nagel], could you elaborate on what seems favourable? Yes this will
increase the size of the segments which is non trivial. I think that this
plugin approach is less intrusive with the current class signatures. It works
well with our use case in which we don't need the segments once that they are
indexed.
> Plugin to index the raw content of a readable document.
> --------------------------------------------------------
>
> Key: NUTCH-2032
> URL: https://issues.apache.org/jira/browse/NUTCH-2032
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, parser
> Affects Versions: 1.10
> Reporter: Luis Lopez
> Labels: content, index, index-rawcontent, parser, raw
> Fix For: 1.11
>
>
> This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and
> https://issues.apache.org/jira/browse/NUTCH-1458
> We created a couple plugins to index the raw content of readable documents.
> If we include these plugins in the plugin chain we'll index the raw content
> of a readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent
> plugin is not designed to index binary files, however having the full content
> of an HTML/XML or a CSV document is really critical for some of us.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)