Marcos Bori created NUTCH-2433:
----------------------------------
Summary: Html Parser: keep htmltag where the outlinks are found
Key: NUTCH-2433
URL: https://issues.apache.org/jira/browse/NUTCH-2433
Project: Nutch
Issue Type: New Feature
Components: parser
Affects Versions: 1.13
Environment: Apache Nutch release 1.13.
Reporter: Marcos Bori
When parsing HTML pages, I need to know in which HTML tag the outlinks were
found (for example, 'a', 'script', 'img', etc).
I propose to add a new configuration value,
"parser.html.outlinks.htmlnode_metadata_name".
If this configuration property is not empty, all found outlinks will be
assigned a metadata with the name indicated in this configuration property with
the html tag name where the outlink was found.
I will now send the pull request with my code implementation.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)