[
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-1749:
-----------------------------------
Fix Version/s: (was: 1.16)
1.17
> Optionally exclude title from content field
> -------------------------------------------
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.7
> Reporter: Greg Padiasek
> Priority: Major
> Fix For: 1.17
>
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since
> the title alone can be retrieved via DOMContentUtils.getTitle() and content
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate
> title in the content. When title is included in the content it becomes
> difficult/impossible to extract document body without title. A need to
> extract document body without title is visible when user wants to index or
> display body and title separately.
> Attached is a patch which prevents including title in document content in the
> HTML parser plugin.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)