Marcos Bori created NUTCH-2433:
----------------------------------

             Summary: Html Parser: keep htmltag where the outlinks are found
                 Key: NUTCH-2433
                 URL: https://issues.apache.org/jira/browse/NUTCH-2433
             Project: Nutch
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.13
         Environment: Apache Nutch release 1.13.
            Reporter: Marcos Bori


When parsing HTML pages, I need to know in which HTML tag the outlinks were 
found (for example, 'a', 'script', 'img', etc).

I propose to add a new configuration value, 
"parser.html.outlinks.htmlnode_metadata_name".
If this configuration property is not empty, all found outlinks will be 
assigned a metadata with the name indicated in this configuration property with 
the html tag name where the outlink was found.

I will now send the pull request with my code implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to