[
https://issues.apache.org/jira/browse/NUTCH-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987520#comment-13987520
]
Julien Nioche commented on NUTCH-1622:
--------------------------------------
Hi Daniel
Sorry for not commenting on your patch before, I hadn't seen it. We need a more
generic mechanism than the HTMLParser for this as it would have to be done
potentially for all the flavours of Parsers that can exist (e.g. Tika one).
Nutch 2.x does not have a ParseOutputFormat. Wouldn't it be better to write the
metadata for the outlinks as part of the DbUpdate* code?
> Create Outlinks with metadata
> -----------------------------
>
> Key: NUTCH-1622
> URL: https://issues.apache.org/jira/browse/NUTCH-1622
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.7, 2.2.1
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.8, 2.4
>
> Attachments: NUTCH-1622-2.x.patch, NUTCH-1622.patch
>
>
> Having the possibility to specify metadata when creating an outlink is
> extremely useful as it allows to pass information from a source page to the
> pages it links to. We use that routinely within our custom parsers in
> combination with the url-meta plugin.
--
This message was sent by Atlassian JIRA
(v6.2#6252)