Greg Padiasek created NUTCH-1749:
------------------------------------
Summary: Title duplicated in document body
Key: NUTCH-1749
URL: https://issues.apache.org/jira/browse/NUTCH-1749
Project: Nutch
Issue Type: Bug
Components: parser
Affects Versions: 1.7
Reporter: Greg Padiasek
The HTML parser plugin inserts document title into document content. Since the
title alone can be retrieved via DOMContentUtils.getTitle() and content is
retrieved via DOMContentUtils.getText(), there is no need to duplicate title in
the content. When title is included in the content it becomes
difficult/impossible to extract document body without title. A need to extract
document body without title is visible when user wants to index or display body
and title separately.
Attached is a patch which prevents including title in document content in the
HTML parser plugin.
--
This message was sent by Atlassian JIRA
(v6.2#6252)