[jira] [Updated] (NUTCH-1749) Optionally exclude title from content field

Sebastian Nagel (JIRA) Sun, 17 Dec 2017 02:57:39 -0800

     [ 
https://issues.apache.org/jira/browse/NUTCH-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel updated NUTCH-1749:
-----------------------------------
    Fix Version/s:     (was: 1.14)
                   1.15

> Optionally exclude title from content field
> -------------------------------------------
>
>                 Key: NUTCH-1749
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1749
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.7
>            Reporter: Greg Padiasek
>             Fix For: 1.15
>
>         Attachments: DOMContentUtils.patch
>
>
> The HTML parser plugin inserts document title into document content. Since 
> the title alone can be retrieved via DOMContentUtils.getTitle() and content 
> is retrieved via DOMContentUtils.getText(), there is no need to duplicate 
> title in the content. When title is included in the content it becomes 
> difficult/impossible to extract document body without title. A need to 
> extract document body without title is visible when user wants to index or 
> display body and title separately.
> Attached is a patch which prevents including title in document content in the 
> HTML parser plugin.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-1749) Optionally exclude title from content field

Reply via email to