[
https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2433.
------------------------------------
Resolution: Fixed
Fix Version/s: 1.14
Thanks, committed to 1.x,
[777e759|https://github.com/apache/nutch/commit/777e759ada24eac84072a5f1722938442432eadc].
> Html Parser: keep htmltag where the outlinks are found
> ------------------------------------------------------
>
> Key: NUTCH-2433
> URL: https://issues.apache.org/jira/browse/NUTCH-2433
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.13
> Environment: Apache Nutch release 1.13.
> Reporter: Marcos Bori
> Labels: html, outlink
> Fix For: 1.14
>
>
> When parsing HTML pages, I need to know in which HTML tag the outlinks were
> found (for example, 'a', 'script', 'img', etc).
> I propose to add a new configuration value,
> "parser.html.outlinks.htmlnode_metadata_name".
> If this configuration property is not empty, all found outlinks will be
> assigned a metadata with the name indicated in this configuration property
> with the html tag name where the outlink was found.
> I will now send the pull request with my code implementation.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)