[ https://issues.apache.org/jira/browse/NUTCH-750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-750: -------------------------------- Component/s: parser > HtmlParser plugin - page title extraction > ----------------------------------------- > > Key: NUTCH-750 > URL: https://issues.apache.org/jira/browse/NUTCH-750 > Project: Nutch > Issue Type: Improvement > Components: parser > Affects Versions: 1.0.0 > Reporter: Alexey Torochkov > Priority: Minor > Fix For: 1.1 > > Attachments: SkipBody.patch > > > A little improvement to trying to extract <title> tag in body if it doesn't > exist in head. > In current version DOMContentUtils just skip all after <body> in getTitle() > method. > Attached patch allows to change this behavior (for default it doesn't change > anything) and can cope with webmasters mistakes -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.