[jira] [Commented] (NUTCH-1749) Title duplicated in document body

Sebastian Nagel (JIRA) Fri, 22 Aug 2014 12:18:33 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107326#comment-14107326
 ]


Sebastian Nagel commented on NUTCH-1749:
----------------------------------------

Hi, Greg! Indeed, it may be sometimes useful to not include title in content, 
e.g. if title and (short) content are displayed in search results. However,
- should be made optional by a property "indexer.content.with.title" (or 
similar). Otherwise users would need to adapt the search logic if word from 
title are not contained in body.
- should be done for parse-tika as well
- a hard-wired exclusion of <title> elements in method {{getTextHelper}} is not 
really transparent, esp. because it is also used by {{getTitle}} and you need 
the construct {{currentNode != node && "title".equalsIgnoreCase(nodeName)}}. 
Wouldn't it be much clearer (and more extensible) to add an extra argument with 
excluded tags/elements (filled/set by the calling method). Roughly:
{code}
private boolean getTextHelper(..., Set excludedElementNames) {
  ...
  if (excludedElementNames.contains(nodeName)) {
   walker.skipChildren();
  }
{code}


> Title duplicated in document body
> ---------------------------------
>
>                 Key: NUTCH-1749
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1749
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Greg Padiasek
>         Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-1749) Title duplicated in document body

Reply via email to