[
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107326#comment-14107326
]
Sebastian Nagel commented on NUTCH-1749:
----------------------------------------
Hi, Greg! Indeed, it may be sometimes useful to not include title in content,
e.g. if title and (short) content are displayed in search results. However,
- should be made optional by a property "indexer.content.with.title" (or
similar). Otherwise users would need to adapt the search logic if word from
title are not contained in body.
- should be done for parse-tika as well
- a hard-wired exclusion of <title> elements in method {{getTextHelper}} is not
really transparent, esp. because it is also used by {{getTitle}} and you need
the construct {{currentNode != node && "title".equalsIgnoreCase(nodeName)}}.
Wouldn't it be much clearer (and more extensible) to add an extra argument with
excluded tags/elements (filled/set by the calling method). Roughly:
{code}
private boolean getTextHelper(..., Set excludedElementNames) {
...
if (excludedElementNames.contains(nodeName)) {
walker.skipChildren();
}
{code}
> Title duplicated in document body
> ---------------------------------
>
> Key: NUTCH-1749
> URL: https://issues.apache.org/jira/browse/NUTCH-1749
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.7
> Reporter: Greg Padiasek
> Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since
> the title alone can be retrieved via DOMContentUtils.getTitle() and content
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate
> title in the content. When title is included in the content it becomes
> difficult/impossible to extract document body without title. A need to
> extract document body without title is visible when user wants to index or
> display body and title separately.
> Attached is a patch which prevents including title in document content in the
> HTML parser plugin.
--
This message was sent by Atlassian JIRA
(v6.2#6252)