[
https://issues.apache.org/jira/browse/NUTCH-1944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332417#comment-14332417
]
Sebastian Nagel commented on NUTCH-1944:
----------------------------------------
This issue duplicates NUTCH-1785 but this solution via an IndexingFilter plugin
is only for 2.x. On 1.x an indexing filter cannot request the raw content from
segments, which is addressed in NUTCH-1785 by implementing the functionality in
the indexer core. However, an IndexingFilter seems to be the simpler and more
modular solution.
Conversion from raw content is implicitly done relying on the system's locale
(cf. NUTCH-1807). The encoding used to represent the HTML as a string should be
predictable, as [discussed in
NUTCH-1785|https://issues.apache.org/jira/browse/NUTCH-1785?focusedCommentId=14011649&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-140116499].
> Add raw content to indexes
> --------------------------
>
> Key: NUTCH-1944
> URL: https://issues.apache.org/jira/browse/NUTCH-1944
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, plugin
> Reporter: Lewis John McGibbney
> Fix For: 2.4
>
>
> The issues is described very well here
> https://github.com/Meabed/nutch2-index-html
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)