Luis Lopez created NUTCH-2032:
---------------------------------
Summary: Plugin to index the raw content of a readable document.
Key: NUTCH-2032
URL: https://issues.apache.org/jira/browse/NUTCH-2032
Project: Nutch
Issue Type: New Feature
Components: indexer, parser
Affects Versions: 1.10
Reporter: Luis Lopez
Fix For: 1.11
This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and
https://issues.apache.org/jira/browse/NUTCH-1458
We created a couple plugins to index the raw content of readable documents. If
we include these plugins in the plugin chain we'll index the raw content of a
readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent plugin is
not designed to index binary files, however having the full content of an
HTML/XML or a CSV document is really critical for some of us.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)